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Preface 


This book is not about statistical theory, neither is it meant to teach R 
programming. This book is intended for readers who know the ba¬ 
sics of R, but find themselves with problems or situations that are com¬ 
monly encountered by newcomers to R or for readers who want to see 
compact examples of different types of typical statistical analyses. In 
other words, if you understand basic statistics and already know a bit 
about R then this book is for you. 

R has rapidly become the lingua franca of statistical computing; it is a 
free statistical programming software and it can be downloaded from 
http : / / cran . r-pro ject. org. Many newcomers to R are often in¬ 
timidated by the command-line interface, or the sheer number of func¬ 
tions and packages, or just trying to figure out how to import data and 
perform a simple statistical analysis. 

The book consists of a number of examples that illustrate a specific sit¬ 
uation, topic or problem from data import over data management and 
classical statistical analyses to graphics. Each example is self-contained 
and provides R code that can be run exactly as shown and the results 
from the book can be reproduced. The only change — barring simu¬ 
lated data, machine set-up and small tweaks to make figures suitable 
for printing — is that some of the output lines have been removed for 
brevity. 

This is not a "missing manual" or a thorough exploration of the func¬ 
tions used. Instead of trying to cover every possible option or special 
case that might be of interest, we focus on the common situations that 
most beginning users are likely to encounter. Thus we concentrate on 
the basics of getting things done and giving examples that can be used 
as a starting point for the reader rather than exploring the multitude of 
options available with every command and the ever-increasing num¬ 
ber of packages. For most problems — and this is particularly true for a 
programming language like R— there is more than one way to solve a 
problem. Here, I have provided a single solution to most problems and 
have tried to use base R if at all possible. If there are other functions 
and/or packages available that cover or extend the same functionality, 
then some of them are listed at the end of each example. 
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xii 

The R list of frequently asked questions is highly recommended and 
covers a few of the same topics mentioned here. However, it does not 
cover examples of statistical analyses and it rarely covers some of the 
most basic problems new users encounter. 

Base graphics are used throughout the book. More advanced graph¬ 
ics can be produced with the recent lattice and ggplot2 packages 
(see Sarkar (2008), Wickham (2009), or Murrell (2011) for further in¬ 
formation on advanced R graphics). A more complete coverage of R 
and / or statistics can be found in the books by Venables and Ripley 
(2002), Verzani (2005), Crawley (2007), Dalgaard (2008), and Everitt and 
Hothorn (2010). These books have a slightly different target audience 
than the present text and are all highly recommended. 

The R Primer has a supporting web site at 

http://www.statistics.life.ku.dk/primer/ 

where additional topics are covered and where the R code used in the 
book can be found. 

I would like to thank all R developers and package writers for the 
enormous work they have done and continue to put into the R pro¬ 
gram and extensions. I appreciate all the helpful responses to my en¬ 
quiries and suggestions. I am grateful to my colleagues at the Fac¬ 
ulty of Life Sciences, University of Copenhagen as well as Klaus K. 
Holst, Duncan Temple Lang, and Bendix Carstensen for their ideas, 
comments, suggestions and encouragement on various stages of the 
manuscript. Many thanks to Tina Ekstrom for once again creating a 
wonderful cover, and last, but not least, thanks to Marlene, Ellen and 
Anna for bearing with me through yet another book. 

Claus Thorn Ekstrom 
Frederiksberg 2011 
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Chapter 1 

Importing data 


1.1 Read data from a text file 

Problem: You want to import a dataset stored in an ASCII text file. 

Solution: Data stored in simple text files can be read into R using the 
read, table function. By default, the observations should be listed 
in columns where the individual fields are separated by one or more 
white space characters, and where each line in the file corresponds to 
one row of the data frame. The columns do not need to be straight or 
formatted, but multi-word observations like high income need to be 
put in quotes or combined into a single word. Assume we have a text 
file, mydata . txt, with the following contents 


acid 

digest 

name 

30.3 

70.6 

NA 

29.8 

67.5 

Eeny 

NA 

87.0 

Meeny 

4 . 1 

89.9 

Miny 

4 . 4 


Moe 

2 . 8 

93.1 


3 . 8 

96.7 

1? If 


which we read the data into R with the following command: 

> indata <- read.table("mydata.txt ", header=TRUE) 

> indata 



acid 

digest 

name 

l 

30.3 

70.6 

<NA> 

2 

29.8 

67.5 

Eeny 

3 

NA 

87.0 

Meeny 

4 

4.1 

89.9 

Miny 

5 

4.4 


Moe 

6 

2.8 

93.1 


7 

3.8 

96.7 



The first argument is the name of the data file, and the second argu¬ 
ment (header=TRUE) is optional and should be used only if the first 
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line of the text file provides the variable names. If the first line does 
not contain the column names, the variables will be labeled consecu¬ 
tively VI, V2, V3, etc. Each line in the input file must contain the same 
number of columns for read .table to work. The sep option should 
be included to indicate which character separates the columns if the 
columns are separated by other characters than spaces. For example, 
if the columns are separated by tabs then we can use s e p = " \ t" . Data 
read with read .table are stored as a data frame within R. 

The default code for missing observations is the character string NA 
which we can see works both for the first and third observation above 
(acid is read as a numeric vector and name as a factor). Empty char¬ 
acter fields are scanned as empty character vectors, unless the option 
na. strings contains the value " " in which case they become missing 
values. Empty numeric fields (for example if the columns are separated 
by tabs) are automatically considered missing. 

> indata <- read.table("mydata.txt", header=TRUE, 

+ na.strings=c("NA", "")) 

> indata 



acid 

digest 

name 

1 

30.3 

70.6 

<NA> 

2 

29.8 

67.5 

Eeny 

3 

NA 

87.0 

Meeny 

4 

4.1 

89.9 

Miny 

5 

4.4 


Moe 

6 

2.8 

93.1 


7 

3.8 

96.7 

<NA> 


Note that due to the period '.' for observation 5, R considers the vari¬ 
able digest as a factor and not numeric since the period is read as a 
character string. If periods should also be considered missing variables 
we need to include that in na . strings. 

> indata <- read.table("mydata.txt", header=TRUE, 


+ na.strings=c("NA", ".")) 

> indata 

acid digest name 

1 30.3 70.6 <NA> 

2 29.8 67.5 Eeny 

3 NA 87.0 Meeny 

4 4.1 89.9 Miny 

54.4 NA Moe 

6 2.8 93.1 <NA> 

7 3.8 96.7 <NA> 


R looks for the file my data .txt in the current working directory, but 
the full path can be specified in the call to read. table, e.g., 

> indata <- read.table("d:/mydata.txt", header=TRUE) 

See Rule 5.12 on how to change the current working directory. 
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1.2 Read data from a simple XML file 

Problem: You want to import a dataset stored as a simple structure in 
the XML file format. 

Solution: The XML (extensible Markup Language) was designed to 
transport and store data and XML has seen widespread use in inter¬ 
changing data over the Internet. 

An XML file consists of a series of elements which form a document 
tree. The tree starts at the root and branches to the lowest level of the 
tree. XML documents must contain a root node (or element) which is 
"the parent" of all other nodes, and all nodes can have their own sub¬ 
nodes ("child elements"). 

An example XML file is shown below where the tree data from the 
trees dataset are stored in XML format. The root node <document> 
has several child nodes (the <rows>) and each row has its own child 
elements corresponding to the variables in the data frame and their val¬ 
ues. 

<?xml version="1.0"?> 

<document> 

<row> 

<Girth>8.3</Girth> 

<Height>70</Height> 

<Volume>10.3</Volume> 

</row> 

<row> 

<Girth>8.6</Girth> 

<Height>65</Height> 

<Volume>10.3</Volume> 


<Volume>77</Volume> 

</row> 

</document> 

The XML package provides numerous tools for parsing and generating 
XML in R. Since XML is such a flexible format, the XML package pri¬ 
marily consists of functions that must be combined to parse and extract 
information from a specific type of XML structure. 

XML document files with a simple structure can be imported and con¬ 
verted to a data frame directly using the xmlToDataFrame function. 
By simple, we mean a collection of nodes that have the same sub-nodes 
such that each node corresponds to an observation or row in the data 
frame and each of its sub-nodes contains primitive values correspond- 
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ing to the variables. The data file shown above has such a simple struc¬ 
ture. 

> library(XML) 

> url <- "http://www.statistics.life.ku.dk/primer/mydata.xml" 

> indata <- xmlToDataFrame(url) 

> head(indata) 



Girth 

Height 

Volume 

1 

8.3 

70 

10.3 

2 

8.6 

65 

10.3 

3 

8.8 

63 

10.2 

4 

10.5 

72 

16.4 

5 

10.7 

81 

18.8 

6 

10.8 

83 

19.7 


Note: Installing the XML package can be a little trickier than other pack¬ 
ages. The package uses libxml, the XML parser that is frequently found 
as part of the Gnome system, but also exists as a stand-alone library for 
many systems. 

See also: Use Rule 1.3 to import XML files that do not have a simple 
structure. 


1.3 Read data from an XML file 

Problem: You want to import a dataset stored in the XML file format 
by manually coding how to extract the relevant information. 

Solution: Rule 1.2 showed how to import data from an XML (exten¬ 
sible Markup Language) file with a simple structure. Here we will try to 
import data from a more non-trivial situation, which xmlToDataFrame 
cannot handle. 

As a more complex example we will try to import the following XML 
file that contains artificial data on currency exchange rates. The first 
couple of nodes are document creation data, while the actual exchange 
rates begin with the <rates> node. The exchange rates (measured 
against the euro) and dates are coded as tags to the <exch> and <Date> 
nodes while the bank source is a leaf node with no children. Also note, 
that information from both banks are not available for both dates and 
that not all exchange rates nodes may be present. 

<?xml version="1.0"?> 

<bankdata> 

<author>Claus</author> 

<valid>Not at all</valid> 
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<rates> 

<Date time =,, 2011-03-10 n > 

<bank> 

<source>Some bank</source> 

<exch currency="USD n rate="l.3817"/> 
<exch currency="DKK" rate="7.4581"/> 
</bank> 

<bank> 

<source>Some other bank</source> 
<exch currency="USD" rate="l.2382"/> 
<exch currency="DKK" rate="7.3312"/> 
</bank> 

</Date> 

<Date time="2011-03-09 n > 

<bank> 

<source>Some bank</source> 

<exch currency="USD" rate="1.3884"/> 
</bank> 

</Date> 

</rates> 

</bankdata> 


Recall that an XML tree structure consists of a series of nodes branching 
out from the root node, and that each of the nodes may itself have chil¬ 
dren. Data are stored either as values or as attributes/tags of a node. 

The xmlTreeParse is the work-horse for importing general XML doc¬ 
uments. xmlTreeParse parses an XML file and stores the tree in an 
R structure. We subsequently traverse the tree and extract data from 
the relevant nodes. xmlTreeParse requires a file name or location as 
input for where to find the XML file, and it returns an R XML object 
with the parsed XML file. The uselnternalNodes option can be set 
to TRUE to increase parsing speed. 

First, xmlRoot should be called to get a pointer to the top-level node or 
parent of the XML tree. The skip option can be set to FALSE to prevent 
R from skipping over document type definitions in the XML file if those 
are present. 

The XML tree structure works like a recursive list-like object and the 
individual nodes in the tree are accessed using named or numbered 
indices, [ [ ] ]. The XML tree can be traversed with the proper indices 
and for each node we can get the parent and list of children sub-nodes 
using the xmlParent and xmlChildren functions, respectively. 
Information can be extracted from a node using one of the xmlName, 
xmlValue, xmlGetAttr and xmlAttrs functions, which return the 
node name, node contents, a named attribute and all attributes, respec¬ 
tively. 

> library(XML) 

> # Location of the example XML file 
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> url <- "http://www.statistics.life.ku.dk/primer/bank.xml" 

> # Parse the tree 

> doc <- xmlTreeParse(url, use!nternalNodes=TRUE) 


> top <- xmlRoot(doc) 

> xmlName(top) 

[1] "bankdata" 

> names(top) 

author valid rates 
"author" "valid" "rates" 

> xmlValue(top[[1]]) 

[1] "Claus" 

> xmlValue(top[["author"]]) 
[1] "Claus" 

> names(top[[3]] ) 

Date Date 

"Date" "Date" 

> xmlAttrs(top[[3]] [ [1]]) 

time 


# Identify the root node 

# Show node name of root node 

# Name of root node children 

# Access first element 

# First element with named index 

# Children of node 3 

# Extract tags from a Date node 


"2011-03-10" 

> top[["rates"]][[1]][[1]] # Tree from first bank and date 

<bank> 

<source>Some bank</source> 

<exch currency="USD" rate="l.3817"/> 

<exch currency="DKK" rate="7.4581"/> 

</bank> 

> xmlValue(top[[3]][[1]][[1]][[1]]) # Bank name is node value 
[1] "Some bank" 

> xmlAttrs(top[[3]][[l]][[l]][[l]]) # but has no tags 
NULL 

> xmlAttrs(top[[3]] [[1 ] ] [ [ 1]] [ [2]]) # The <exch> node has tags 

currency rate 

"USD" "1.3817" 

> xmlValue(top[[3]][[1]][[1]][[2]]) # but no value 

[ 1 ] "" 


A function can be applied recursively to children of a node using the 
xmlApply and xmlSApply functions, which works similarly as apply 
and sapply except for XML tree structures. Extracting the individual 
exchange rates and combining them with the proper bank name and 
date can be quite cumbersome using indices, loops and xmlApply. In¬ 
stead, we can use the XML Path Language, XPath, to query and ex¬ 
tract information from specific nodes in the XML tree structure. Ta¬ 
ble 1.1 shows examples of useful XPath query strings. These can be 
used with the xpathApply or xpathSApply functions, which accept 
a node from where to start the search as first argument, an XPath query 
string as second argument, and the function to apply as third argument. 

> # Search tree for all source nodes and return their value 

> xpathSApply(doc, "//source", xmlValue) 

[1] "Some bank" "Some other bank" "Some bank" 


> # Search full tree for all exch nodes where currency is "DKK" 
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Expression 

Description 

/node 

top-level node only 

//node 

node at any level 

//node[Sname] 

node with an attribute named "name" 

//node[@name="a 

" ] node with named attr. with value "a" 

//node/@x 

value of attribute x in node with such attr. 


> xpathApply(doc, "//exch[@currency='DKK']", xmlAttrs) 

[ [ 1 ] 1 

currency rate 

"DKK" "7.4581" 

[ [ 2 ] ] 

currency rate 

"DKK" "7.3312" 


Below we search through the complete XML tree for each node, <exch>, 
which may be found at any level. From the <exch> node we extract its 
attributes and get the bank name through its parent and the exchange 
rate date from the time attribute from its grandparent. The do . call 
function is used rbind to combine the resulting lists. 


> res <- xpathApply(doc, "//exch", 

+ function(ex) { 

+ c(xmlAttrs(ex) , 

+ bank=xmlValue(xmlParent(ex) [ [ "source"]]), 

+ date=xmlGetAttr(xmlParent(xmlParent(ex)), "time")) 

+ }) 

> result <- do.call(rbind, res) 

> result 



currency 

rate 

bank 



date 

[l, ] 

"USD" 

"1.3817" 

"Some 

bank" 


"2011-03-10 

[2, ] 

"DKK" 

"7.4581" 

"Some 

bank" 


"2011-03-10 

[3, ] 

"USD" 

"1.2382" 

"Some 

other 

bank" 

"2011-03-10 

[4, ] 

"DKK" 

"7.3312" 

"Some 

other 

bank" 

"2011-03-10 

[5, ] 

"USD" 

"1.3884" 

"Some 

bank" 


"2011-03-09 


The variables in the resulting object can then be converted to their proper 
formats and combined in a data frame. 
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1.4 Read data from an SQL database using ODBC 

Problem: Import data from an application that supports Open Data¬ 
Base Connectivity. 

Solution: Open DataBase Connectivity (ODBC) makes it possible for 
any application to access data from a SQL database regardless of which 
database management system is used to handle the data. 

The RODBC package provides an interface to databases that support 
an ODBC interface, which includes most popular commercial and free 
databases such as MySQL, PostgreSQL, Microsoft SQL Server, Microsoft 
Access, and Oracle. Having one package with a common interface al¬ 
lows the same R code to access different database systems. 

The odbcConnect function opens a connection to a database and re¬ 
turns an object which works as a handle for the connection. A character 
string containing the data source name (DSN) should be supplied as 
the first argument to odbcConnect to set the database server to con¬ 
nect to. The function has two optional arguments, uid and pwd, which 
set the user id and password for authentication, respectively, if that is 
required by the database server, and is not provided by the DSN. 

The data source name is located in a separate text file or in the reg¬ 
istry and it contains the information that the ODBC driver needs in 
order to connect to a specific database. This includes the name, di¬ 
rectory, and driver of the database, and possibly the user id and pass¬ 
word. Each database requires a separate entry in the DSN, and DSN- 
less connections require that all the necessary information to be sup¬ 
plied within R (for example by using odbcDriverConnect instead of 
odbcConnect). The information for setting up the DSN should accom¬ 
pany your database software. 

Once a connection to a database server is established, then the available 
database tables can be seen with the sqlTables function, with the 
proper handle as first argument. The sqlFetch function fetches the 
entire table from the SQL database and returns it as an R data frame. 
sqlFetch requires two arguments, where the first is the connection 
handle and the second is a character string containing the desired table 
to extract from the database. 

The workhorse is the sqlQuery function which is used to make SQL 
queries directly to the database and return the results as R data frames. 
The first argument to sqlQuery sets the connection channel to use and 
the second argument is the selection string which should be specified 
as a regular SQL query string. Finally, the odbcClose function closes 
the connection to the channel specified by the first argument. 
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In the example below, we use an existing DSN to access a database 
called "myproject" which contains a "paper" table. The entire table 
is extracted as well as a selection of salespeople with sales larger than a 
given number. 

> library(RODBC) 

> # Connect to SQL database with username and password 

> channel <- odbcConnect("mydata", uid="tv", pwd="office") 

> sqlTables(channel) # List tables in the database 

TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE REMARKS 

1 myproject paper TABLE 

> mydata <- sqlFetch(channel, "paper") # Fetch entire table 

> mydata 

ID Sales Person 

13 10 David Brent 

24 12 Michael Scott 

35 13 Gareth Keenan 

46 20 Tim Canterbury 

57 13 Jim Halpert 

68 23 Dwight Schrute 

> sqlQuery(channel, 

+ "SELECT * FROM paper WHERE Sales>12 ORDER BY Person") 

ID Sales Person 

18 23 Dwight Schrute 

25 13 Gareth Keenan 

37 13 Jim Halpert 

46 20 Tim Canterbury 

> odbcClose(channel) 


See also: Rule 1.7 for an example on how to use the RODBC package to 
import data from Excel under Windows. Microsoft Access Databases 
can be accessed directly from within Windows without creating a DSN 
using the odbcConnectAccess or odbcConnectAccess2007 func¬ 
tions. 


Reading spreadsheets 


1.5 Read data from a CSV file 

Problem: You want to import a dataset stored as a comma-separated 
values (CSV) file. 

Solution: A CSV file is a plain text file where each line corresponds 
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to a case and where the case variables are separated by commas. Quo¬ 
tation marks are used to embed field values that contain the separator 
character. 

Use read, csv to read in the delimited file just like you would use 
read. table. Actually, read. csv is a wrapper function that sets the 
correct options for read. table. 

read. csv2 is used to read in semicolon separated files which is the 
default CSV-format in some locales where comma is the decimal point 
character and where semicolon is used as separator. To read the CSV 
file mydata. csv: 

"Id", "Sex","Age","Score" 


21, 

"Male", 

14,"Little" 

26, 

"Male", 

13,"None" 

27, 

"Male", 

13,"Moderate, severe 

29, 

"Female 

",13,"Little" 

30, 

"Female 

",15,"Little" 

i—1 
00 

"Male", 

14,"Moderate, severe 


we use the command 


> indata <- read.csv("mydata.csv") 

> head(indata) 



Id 

Sex 

Age 


Score 

i 

21 

Male 

14 


Little 

2 

26 

Male 

13 


None 

3 

27 

Male 

13 

Moderate, 

severe 

4 

29 

Female 

13 


Little 

5 

30 

Female 

15 


Little 

6 

31 

Male 

14 

Moderate, 

severe 


Both read, csv and read. csv2 assume that a header line is present 
in the CSV file (i.e., header=TRUE is the default). If no header line is 
present you need to specify the header=FALSE argument. 


1.6 Read data from an Excel spreadsheet 

Problem: You want to import a dataset stored as a Microsoft Excel file. 

Solution: The easiest way to read data from an Excel spreadsheet into 
R is to export the spreadsheet to a delimited file like a comma-separated 
file and then import the CSV file as described in Rule 1.5. 

Alternatively, the read.xlsx function from the xlsx can be used to 
read Excel worksheets (both older Excel formats as well as Excel 2007 
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and above formats) directly into R. The xlsx package depends on the 
xlsxjars and r Java packages so they need to be installed for xlsx 
to work. 

The first argument to read. xl sx should be the path to the Excel spread¬ 
sheet, and the second argument, sheet Index, takes a number repre¬ 
senting which sheet to import from the Excel file. By default, the first 
line in the Excel sheet is assumed to be a header line, but that can be 
changed by setting the header=FALSE option. 

The following Excel worksheet saved as the file cunningplan . xlsx 
can be imported with the following code: 


o 

Cl * 


A Home Layout Tables 

Charts 

« 

1 " 

Edit 

Font 


w fjj Fill » Calibri (Body) 

» 12 

Hi: 

Paste y Clear ’ B / u 

1 V 

& 



A1 

T O 

fi 

< Name 



B 

c 

D 



Name 

DOB 

value 

Rank 


2 

Blackadder 

06/01/55 

1 

Cpt 


3 

Baldrick 

15/08/46 

2 

Pvt 


4 

George 

11/06/59 

3 

Lt 


5 

Melchett 

24/08/57 

4 

Gen 


6 

Darling 

18/09/56 

5 

Cpt 


7 







> library(xlsx) 

Loading required package: xlsxjars 
Loading required package: rJava 

> goesforth <- read.xlsx("Documents/cunningplan.xlsx", 

> goesforth 

Name DOB value Rank 


1 Blackadder 1955-01-06 1 Cpt 

2 Baldrick 1946-08-15 2 Pvt 

3 George 1959-06-11 3 Lt 

4 Melchett 1957-08-24 4 Gen 

5 Darling 1956-09-18 5 Cpt 


1 ) 


Note that we set the sheet to load with the second argument. The 
rowlndex and collndex options can be set to numeric vectors to in¬ 
dicate which rows and columns to extract from the worksheet, respec¬ 
tively. They are set to NULL by default, which means that all rows and 
columns are imported. 

See also: If you have Perl installed on your computer, you can use the 


www.Ebook777.com 













12 


The R Primer 


read. xls function from the gdata package. It automates the process 
of saving the Excel sheet as a CSV file and reading it in R. See Rule 1.9 
for an example of importing spreadsheet data through the clipboard. 


1.7 Read data from an Excel spreadsheet under Windows 

Problem: You want to read data stored in an Excel spreadsheet on a 
machine running Windows. 

Solution: You can always use Rule 1.6 to read Excel spreadsheets un¬ 
der Windows. However, under Windows the package RODBC can be 
used to import data directly from an Excel spreadsheet into R. 

To access the Excel file example .xls we first open a connection using 
odbcConnectExcel, then extract any information from the spread¬ 
sheet using the sqlFetch function before finally closing the ODBC 
connection with odbcClose. For example, the following dataset can 
be read into R by using the commands 


Dp 

A 


example [Compatibility Mode] - 


Home Insert Page Layout 

Formulas 

Data Review View 


* jL _ i r-1 I .—. 

Arial "■ 10 T A A* = = = =j J General 


p «te B / H ' _ ' I <2*-A’ » • ® sP sfc lg - •?! - % > Too i°. 

Clipboard ^ |' Font J5j |Alignment| Number 


A1 - f, |~id 



A 

B 

c 

D 

E 

F 

G 

i 

Id 

Sex 

Age 

Score 



2 

21 

Male 

14 

Little 




3 

26 

Male 

13 

None 




4 

27 

Male 

13 

Moderate, severe 



5 

29 

Female 

13 

Little 




6 

30 

Female 

15 

Little 




7 

31 

Male 

14 

Moderate, severe 



8 








9 








in 







•in _ _ 

!<<►>! Sheet! , Sheet2 , M! 


Ready 


> library(RODBC) 

> channel <- odbcConnectExcel("example.xls") 

> indata <- sqlFetch(channel, "Sheetl") 

> odbcClose(channel) 
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> indata 



Id 

Sex 

Age 


Score 

1 

21 

Male 

14 


Little 

2 

26 

Male 

13 


None 

3 

27 

Male 

13 

Moderate, 

severe 

4 

29 

Female 

13 


Little 

5 

30 

Female 

15 


Little 

6 

31 

Male 

14 

Moderate, 

severe 


Note that some language installations of Excel rename the sheets so 
they are called, for example, "Arkl", "Ark2", ..., instead of "Sheetl", 
"Sheet2", etc. If that is the case you should provide the correct name for 
the sheet in the call to odbcConnectExcel. 

It should also be pointed out that odbcConnectExcel will only work 
with English-language versions of the Microsoft drivers, which may or 
may not be installed in other locales. 

See also: Excel spreadsheets can also be read directly from R using 
the read, xls function from the xlsReadWrite package. The gdata 
package also provides a function read.xls which works by translat¬ 
ing the Excel file into a temporary CSV file using a Perl script and di¬ 
rectly reading the CSV file. See Rule 1.9 for an example of importing 
spreadsheet data through the clipboard. See Rule 1.4 for more exam¬ 
ples of the RODBC package. 


1.8 Read data from a LibreOffice or OpenOffice Calc 
spreadsheet 

Problem: You want to read data stored as a LibreOffice or OpenOffice 
Calc spreadsheet. 


Solution: Start Calc, save the spreadsheet as a CSV file and use Rule 1.5 
to import the spreadsheet. 

See also: Rule 1.9 shows an example of importing spreadsheet data 
through the clipboard. 
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1.9 Read data from the clipboard 

Problem: You have selected some data and copied them to the clip¬ 
board and want to import the selection into R. 

Solution: Sometimes it is desirable to select data from a document, a 
web page, or from a spreadsheet and import the selection directly into 
R. This can be done by copying the selection to the clipboard and then 
subsequently importing the contents of the clipboard into R. This ap¬ 
proach can be used on platforms that have the equivalent of a clipboard, 
which is the case for Windows, Mac OS X and machines running the X 
Window System used on many Linux systems. 

The contents of the clipboard can be read using the value "clipboard" 
for the file option to read .table under Windows and X, any by us¬ 
ing pipe ( "pbpaste" ) under Mac OS X. 



|sheet 1 / 311 Default |[~ 11STD|HTTl[ |[sum = 34||Q ^ + -©||l00%| 


Figure 1.1: Selection of cells to be copied to the clipboard. 

If we select some cells in, for example, an OpenOffice spreadsheet as 
shown in Figure 1.1 and copy the selection to the clipboard then we 
import the selection with the following code: 

> mydata <- read.table(file="clipboard", header=TRUE) # Windows/X 

> mydata <- read.table(pipe("pbpaste"), header=TRUE) # Mac OS X 

> mydata 

very important data 
11 2 3 

2 5 4 3 

3 2 3 2 

4 1 3 5 

Note that read, table reads and parses the input from the clipboard 
as if it had read the data from an ASCII file (see Rule 1.1). This causes 
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problems with empty cells and if there is text with spaces in any of 
the cells. If the data are separated by tabs on the clipboard — which 
is generally the case for spreadsheet data — we can specify the field 
separator to be tabs in the call to read. table, which will handle these 
two situations. 

> mydata <- read.table(file="clipboard", sep="\t", header=TRUE) 

On machines running the XI1 Windows system the "XI l_clipboard" 
value can be used for the file option to copy from the clipboard. 

See also: The help file for the file function lists information about 
clipboards. Spreadsheet data on the clipboard are often stored in the 
Data Interchange Format (DIF) and the read. DIF function can be used 
to read DIF formats directly, read. DIF is sometimes more robust than 
using read, table when there are empty cells. 


Importing data from other statistical software programs 


1.10 Import a SAS dataset 

Problem: You want to import a SAS dataset into R. 

Solution: The read.xport function from the foreign package reads 
SAS datasets stored as SAS transport (XPORT) files. 

SAS datasets are stored in different formats that depend on the oper¬ 
ating system and the version of SAS. To read in SAS datasets it is nec¬ 
essary to save the SAS dataset as a SAS transport (XPORT) file since 
that can be read on any platform. The following code can be used from 
within SAS to store the SAS dataset s as data in the XPORT format. 

libname mydata xport "somefile.xpt"; 

/* Create a dataset in XPORT format and save it in 

* the somefile.xpt file. The file is referenced 

* internally in SAS by the name mydata. 

* Here we take an existing SAS dataset called 

* sasdata and put it into the mydata file. 

* / 
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DATA mydata.thisdata; 
SET sasdata; 


RUN; 

Once the data are stored in the SAS XPORT format we can read the file 
directly using read. xport: 

> library(foreign) 

> indata <- read.xport("somefile.xpt") 

See also: The read.xport function from the SASxport package ex¬ 
tends the functionality for reading SAS XPORT files when custom for¬ 
mats are present in the data. 


1.11 Import an SPSS dataset 

Problem: You want to import an SPSS dataset into R. 

Solution: Datasets stored by the SPSS "save" and "export" commands 
can be read by the read. spss function from the foreign package. 

To read an SPSS dataset saved in the spssf ilename . sav file we use 
the following command in R: 

> library(foreign) 

> indata <- read.spss("spssfilename.sav", to.data.frame = TRUE) 

The to . data . f rame=TRUE option ensures that the SPSS data file is 
stored as a data frame in R. If that option is not included, the dataset is 
stored as a list. 


1.12 Import a Stata dataset 

Problem: You want to import a Stata dataset into R. 

Solution: Datasets stored by the "SAVE" command in Stata can be 
read in R by the read. dta function from the foreign package. 

To read a Stata dataset saved as the file stataf ile. dta we use the 
following commands in R: 
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> library(foreign) 

> indata <- read.dta("statafile.dta") 

The read. dta function has a number of options that specify how vari¬ 
ous data types are read, convert. dates and convert. factors are 
both TRUE by default and they ensure that Stata dates are converted 
to R's Date class and that Stata value labels are used to create factors, 
respectively. 


1.13 Import a Systat dataset 

Problem: You want to import a Systat dataset into R. 

Solution: Rectangular datasets stored by the "SAVE" command in 
Systat are saved as * . sys or * . syd files and they can be read in R by 
the read, systat function from the foreign package. 

To read a Systat dataset saved as the file systat file, syd we use the 
following commands in R: 

> library(foreign) 

> indata <- read.systat("systatfile.syd") 


Exporting data 


1.14 Export data to a text file 

Problem: You want to export a data frame to an ASCII text file. 

Solution: Data stored as simple text files can be read by most programs 
so it is often desirable to export a data frame as an ASCII file, 
write.table writes a data frame or a matrix as an ASCII file. The 
first argument should be the R object to write and the file option sets 
the name of the output file as shown below. 

> data(trees) 

> trees$Girth[2] <- NA # Set to missing 

> write.table(trees, file="mydata.txt") 
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which produces the following file 


"Girth" "Height" "Volume 

"1" 8.3 70 10.3 

"2" NA 65 10.3 

"3" 8.8 63 10.2 

"4" 10.5 72 16.4 


There are several options to write . table that change the output. The 
quote option defaults to TRUE and determines if character or factor 
columns are surrounded by double quotes. quote=FALSE means that 
no columns are quoted and if it is set to a numeric vector then the val¬ 
ues determine which column numbers are to be quoted, sep sets the 
field separator string and it defaults to a single white space character, 
na has a default value of NA and the option sets the string used for miss¬ 
ing values in the file. The options dec and eol sets the character and 
character(s) for decimal points and end-of-lines, respectively. 

The example below produces a file with comma as decimal point, where 
strings are not quoted, where a period represents NA's and where the 
end-of-line characters match the format used on machines running Win¬ 
dows. 

> write.table(trees, file="mydata2.txt", 

+ dec="eol="\r\n", na=".", quote=FALSE) 

which results in the following file 

Girth Height Volume 

1 8,3 70 10,3 

2 . 65 10,3 

3 8, 8 63 10,2 

4 10,5 72 16,4 


1.15 Export a data frame to a CSV file 

Problem: You want to export a data frame to a comma-separated val¬ 
ues (CSV) file. 

Solution: A CSV file is a plain text file where each line in the file 
corresponds to an observation and where the variables are separated 
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from each other by a comma. Quotation marks are used to embed field 
values that contain the separator character. 

The write .csv function creates a delimited file, and the function works 
in the same way as write . table. Actually, write . csv is a wrapper 
function to write . table that sets the correct options to create the CSV 
file, write . csv2 is used to create semicolon separated files which is 
the default CSV-format in some countries where comma is the decimal 
point character and where semicolon is used as separator. 

The following code exports the trees data frame to the mydata. csv 
CSV file. 


> data(trees) 

> write.csv(trees, file = "mydata.csv") 


The start of mydata . csv looks like this 


", "Girth", "Height","Volume" 

1", 8.3, 70, 10.3 

2",8.6,65,10.3 

3",8.8,63,10.2 

4", 10.5,72,16.4 


1.16 Export a data frame to a spreadsheet 

Problem: You have an R data frame and want to export it to a spread¬ 
sheet like Microsoft Excel or LibreOffice/OpenOffice Calc. 


Solution: A general approach to transfer an R data frame to a spread¬ 
sheet is to first save it as a delimited file like a comma-separated file 
as described in Rule 1.15 and then import the CSV file into the spread¬ 
sheet. This will work for virtually any spreadsheet which can import 
CSV files including Microsoft Excel, OpenOffice Calc, and LibreOffice 
Calc. 
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1.17 Export a data frame to an Excel spreadsheet under 
Windows 

Problem: You have an R data frame and want to export it to Microsoft 
Excel on a machine running Windows. 

Solution: It is always possible to transfer an R data frame to Excel by 
first saving it as a delimited file like a comma-separated file as described 
in Rule 1.15 and then import the CSV file in Excel. 

However, under Windows it is possible to use the write.xls function 
from the xlsReadWrite package to store an R data frame directly as 
an Excel file. The following code saves the trees data frame to the 
my data . xls Excel file. 

> library(xlsReadWrite) 

> data(trees) 

> write.xls(trees, "mydata.xls") 

The xlsReadWrite package requires the use of a proprietary file and 
R may prompt you to install that file the first time you try to load 

xlsReadWrite. 


1.18 Export a data frame to a SAS dataset 

Problem: You have an R data frame and want to export it so you can 
use it with the SAS system. 

Solution: SAS datasets are stored in different formats that depend on 
the operating system and the version of SAS. In order to export data to 
a generic SAS dataset it should be saved as a SAS transport (XPORT) 
file since that can be read on any platform. 

The write . xport function from the SASxport package writes R data 
frames as a SAS transport (XPORT) file, write.xport takes one or 
more data frames as input and the file option sets the name of the 
SAS XPORT file. 

> library(SASxport) 

> data(trees) 

> write.xport(trees, file="datafromR.xpt") 

Internally, R uses the toSAS function from the SASxport package to 


www.Ebook777.com 




Importing data 


21 


convert variables from each data frame to the proper format given the 
constraints of the SAS XPORT format. 

In SAS you can import a dataset trees located in the SAS XPORT file 
dataf romR. xpt using the following code. 

/* Create a reference to the xport data file 

* datafromR.xpt. We reference the xport file 

* internally in SAS by the name mydata. 

* Here we read the dataset trees located in 

* the datafromR file and store it as the SAS 

* dataset indata. 

* / 

libname mydata xport "datafromR.xpt"; 

DATA indata; 

SET mydata.trees; 

RUN; 

See also: The write . foreign function from the foreign package 
can also write SAS XPORT datasets. 


1.19 Export a data frame to an SPSS dataset 

Problem: You have an R data frame and want to export it so you can 
use it with the SPSS system. 

Solution: The write . foreign function from the foreign package 
can be used to export an R data frame to SPSS, write . foreign creates 
two files: a data file with the data in a generic format and a code file 
with instructions for SPSS to read the data. The package option sets 
the output format and should be SPSS for SPSS datasets. 

> library(foreign) 

> data(trees) 

> write.foreign(trees, datafile="mydata.dat" , 

+ codefile="mydata.sps", package="SPSS") 


In SPSS, open the syntax file mydata. sps. The syntax of the code file 
will read in the data and apply variable and value labels to variables 
that were factors in R. 
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Depending on your SPSS installation and where you have placed the 
my data . dat and my data . sps files it may be necessary to specify the 
complete path to mydata. dat in the mydata . sps file. Otherwise it 
looks for the mydata . dat file in the default directory Also note, that 
in some locales the comma is used as decimal point but the data stored 
by R in the mydata . dat file uses period as decimal point. To set the 
decimal point to period, you can execute the following command 

SET DECIMAL DOT. 

before running the mydata . sps script or simply add the command at 
the beginning of the mydata . sps file. 


1.20 Export a data frame to a Stata dataset 

Problem: You have an R data frame and want to export it so you can 
use it with the Stata system. 

Solution: The foreign package provides the write . dta function 
which can be used to export a data frame to the Stata binary format. 
The package option sets the output format and should be Stata for 
Stata datasets. 

> library(foreign) 

> data(trees) 

> write.dta(trees, file="mydata.dta") 

In Stata you can read the file using the use mydata . dta command. 


1.21 Export a data frame to XML 

Problem: You want to export a data frame in the XML (extensible 
Markup Language) file format. 

Solution: The XML (extensible Markup Language) was designed to 
transport and store data and XML has seen widespread use in inter¬ 
changing data over the Internet. 

The XML package provides numerous tools for parsing and generating 
XML in R. In order to export a data frame to XML format we first 
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create an empty XML document tree using the newXMLDoc function, 
add a root node with the newXMLNode function and then iterate over 
each row in the data frame to add data to the tree. The row node tag is 
added to the XML tree when a new row is read from the data frame, and 
the variables are stored as nested individual nodes named according to 
their variable name. The resulting XML tree structure can be saved to a 
file using the saveXML function, where the file option sets the name 
of the output file. 

> library(XML) 

> data(trees) 

> mydata <- trees # Use cherry trees data as example 

> doc <- newXMLDoc() # Start empty XML document tree 

> # Start by adding a document tag at the root of the XML file 

> root <- newXMLNode("document" , doc=doc) 

> invisible( # Make output invisible 

+ lapply(1:nrow(mydata), # Iterate over all rows 

+ function(rowi) { 

+ r <- newXMLNode("row", parent=root) # Create row tag 

+ for(var in names(mydata)) { # Iterate over variables 

+ # Add each observation 

+ newXMLNode(var, mydata[rowi, var], parent = r) 

+ }})) 

> # Now save the XML document tree as file mydata.xml 

> saveXML(doc, file="mydata.xml") 

[1] "mydata.xml" 

The resulting mydata . xml looks as follows (shortened for brevity and 
with a few additional line breaks inserted to make the output more leg¬ 
ible). 

<?xml version="1.0"?> 

<document> 

<row> 

<Girth>8.3</Girth> 

<Height>70</Height> 

<Volume>10.3</Volume> 

</row> 

<row> 

<Girth>8.6</Girth> 

<Height>65</Height> 

<Volume>10.3</Volume> 


<Volume>77</Volume> 
</row> 

</document> 


See also: See Rule 1.2 for more information on the XML format and on 
installing the XML package. 
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Chapter 2 

Manipulating data 


R uses several basic data types that are frequently used in most R func¬ 
tions or calculations. Here we will list the most fundamental data type 
objects: 

• A vector can be either numeric, complex, a character vector, or a 
logical vector. These vectors are a sequence of numbers, complex 
numbers, characters, or logical values, respectively. 

• An array is a vector with a dimension attribute, where the dimen¬ 
sion attribute is a vector of non-negative integers. If the length 
of the dimension vector is k, then the array is /.-dimensional. The 
most commonly used array is a matrix, which is a two-dimensional 
array of numbers. 

• A factor object is used to define categorical (nominal or ordered) 
variables. They can be viewed as integer vectors where each inte¬ 
ger value has a corresponding label. 

• A list is an ordered collection of objects. A vector can only contain 
elements of one type but a list can be used to create collections of 
vectors or objects of mixed type. 

• A data frame is a list of vectors or factors of the same length such 
that each "row" corresponds to an observation. 

The following code uses the functions c, factor, data. frame, and 
list to show examples of the different basic data types. 

> vecl <- 4 # Numeric vector of length 1 

> vecl 
[ 1 ] 4 

> vec2 <- c(l, 2, 3.4, 5.6, 7) # Numeric vector of length 5 

> vec2 

[1] 1.0 2.0 3.4 5.6 7.0 

> # Create a character vector 

> vec3 <- c ("Spare", "a talent", "for an", "old", "ex-leper") 

> vec3 

[1] "Spare" "a talent" "for an" "old" "ex-leper" 

> # And a vector of logical values 
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> vec4 <- c(TRUE, TRUE, FALSE, TRUE, FALSE) 

> vec4 

[1] TRUE TRUE FALSE TRUE FALSE 

> f <- factor(vec3) # Make factor based on vec 3 

> f 

[1] Spare a talent for an old ex-leper 

Levels: a talent ex-leper for an old Spare 

> x <- 1 + 2i # Enter a complex number 

> x 

[1] l+2i 

> sqrt(-l) # R don't see a complex number 

[1] NaN 

Warning message: 

In sqrt(-l) : NaNs produced 

> sqrt(-1 + Oi) # but it does here 

[1] 0+li 

> m <- matrix (1:6, ncol=2) # Create a matrix 

> m 

[,1] [,2] 

[If] 1 4 

[2, ] 2 5 

[3, ] 3 6 

> list(vec2, comp=x, m) # Combine 3 different objects 

[ [ 1 ] ] 

[1] 1.0 2.0 3.4 5.6 7.0 

$comp 
[1] l+2i 

[ [3] ] 

[,1] [,2] 

[If] 1 4 

[2, ] 2 5 

[3, ] 3 6 

> data.frame(vec2, f, vec4) # Make a data frame 



vec2 

f 

vec4 

1 

1.0 

Spare 

TRUE 

2 

o 

Cs] 

a talent 

TRUE 

3 

3.4 

for an 

FALSE 

4 

5.6 

old 

TRUE 

5 

7.0 

ex-leper 

FALSE 


www.Ebook777.com 


Manipulating data 

Table 2.1: Mathematical operators and functions 


27 


Symbol / function 

Description 

+ 

addition 

- 

subtraction 

* 

multiplication 

/ 

division 

A or ** 

exponentiation 

9- 9- 
o o 

modulus 

9-/9- 

o/o 

integer division 

abs(x) 

absolute value 

sqrt(x) 

square root 

ceiling(x) 

smallest integer not less than x 

floor(x) 

largest integer not greater than x 

trunc(x) 

truncate x by discarding decimals 

round(x, digits=0) 

round x to digits decimals 

signif(x, digits=6) 

round to digits significant digits 

cos (x), sin (x) andtan(x) 

cosine, sine, and tangent 

log(x) 

natural logarithm 

log(x, base=2) 

logarithm with base 2 

loglO(x) 

common logarithm 

exp(x) 

exponential function 

9- -j- 9- 
0*^0 

matrix multiplication 


2.1 Using mathematical functions and operations 

Problem: You want to apply basic numeric mathematical function or 
use arithmetic operators in your calculations. 

Solution: R contains a large number of arithmetic operators and math¬ 
ematical functions and some of the most common are listed in Table 2.1. 
The code below shows examples where these functions are used. 


> 5/2 + 2* (5.1 
[ 1 ] 8.1 

> 2**8 
[1] 256 

> 1.61 A 5 

[1] 10.81756 

> 10 %% 3 
[ 1 ] 1 

> 10 %/% 3 


2.3) # Add, subtract, multiply and divide 

# 2 to the power 8 

# 1.61 to the power 5 

# 10 modulus 3 has remainder 1 

# integer division 
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[1] 3 

> abs(-3.1) 

[1] 3.1 

> sqrt(5) 

[1] 2.236068 

> ceiling(4.3) 

[1] 5 

> floor (4.3) 

[1] 4 

> trunc(4.3) 

[1] 4 

> round(4.5) 

[1] 4 

> round(4.51) 

[1] 5 

> round(4.51, digits=l) 

[1] 4.5 

> # Angles for trigonometric functions use radians 


# absolute value of -3.1 

# square root of 5 

# smallest integer larger than 4.3 

# largest integer smaller than 4.3 

# remove decimals 


not degrees 


> cos(pi/2) 

[1] 6.123234e-17 

> sin (pi/4) 

[1] 0.7071068 

> tan(pi/6) 

[1] 0.5773503 

> log(5) 

[1] 1.609438 

> log(5, base=2) 

[1] 2.321928 

> loglO(5) 

[1] 0.69897 

> exp(log(5) t log (3)) 
[1] 15 

Create two matrices 


# cosine of pi/2. 

# sine of pi/4 

# tangent of (pi/6) 

# natural logarithm of 5 

# binary logarithm of 5 

# common logarithm of 5 

# exponential function 


x <- matrix(1:6,ncol=3) 


[, 1 ] 
1 

2 

[, 1 ] 
1 
1 
0 

;*% y 

[, 1 ] 
4 
6 

;*% x 

[, 1 ] 
1 
1 
2 


3] 

5 


> 

> 

> y <- matrix(c(l, 1, 0, 0, 0, 1), ncol=2) 

> x 

[1, 1 
[ 2 , ] 

> y 

[i, 1 
[ 2 , ] 

[3,] 

> x ! 

[ 1 , 1 
[ 2 ,] 

> y ! 

[i, 1 
[ 2 , ] 

[3, ] 


[, 2 ] [, 

3 

4 


[, 2 ] 

0 

0 

1 


# Matrix multiplication 


, 2 ] 

5 

6 


, 2 ] 

3 

3 

4 


# Matrix multiplication 
[, 3 ] 

5 

5 

6 
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2.2 Working with common functions 

Problem: You need to identify and/or use some of the common func¬ 
tions. 

Solution: The large number of functions in R means that users some¬ 
times have difficulties in identifying a function that provides a certain 
functionality The search commands described in Rule 5.1 may help to 
identify some functions, but even that approach may have difficulties 
for many of the common/standard functions. Table 2.1 from Rule 2.1 
lists some of the mathematical functions and operators and Table 2.2 
lists some other frequently used functions found in R. 

Table 2.2: Frequently used functions 


Symbol / function 

Description 

mean(x) 

mean 

median(x) 

median 

sd (x) 

standard deviation 

var (x) 

variance 

IQR(x) 

inter-quartile range 

cor(x,y) 

correlation between x and y 

length(x) 

length of vector x 

sum(x) 

sum 

cumsum(x) 

cumulative or aggregated sum 

sort(x) 

sort 

order(x) 

order 

min (x) and max (x) 

minimum and maximum value 

quantile(x) 

quantile 

is.na(x) 

test for missing observations 

nrow(df) and ncol (df) 

rows and columns in data frame df 


> X <- c( 6 , 8, 1:4) 

> x 

[ 1 ] 681234 


> mean(x) # 

[ 1 ] 4 

> median(x) # 

[1] 3.5 

> sd(x) # 

[1] 2.607681 

> IQR(x) # 


mean of x 
median 

standard deviation 
inter-quartile range 
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[1] 3.25 

> y <- (1:6)**2 # New vector 

> Y 


[1] 1 4 9 16 25 36 

> cor(x,y) # Pearson correlation 

[1] -0.3783846 

> cor(x, y, method="spearman") # Spearman correlation 
[1] -0.3714286 

> cor(x, y, method="kendall") # Kendall correlation 

[1] -0.06666667 

> length(x) # length of x 

[ 1 ] 6 

> sum(x) # sum of elements in x 

[1] 24 


> cumsum(x) 

[1] 6 14 15 17 

> sort(x) 

[ 1 ] 123468 

> order(x) 

[1] 3 4 5 6 1 2 

> min (x) 

[ 1 ] 1 

> max(x) 

[ 1 ] 8 

> quantile(x) 

0% 25% 50% 

1.00 2.25 3.50 


# cumulative sum of all elements of x 
20 24 

# sort the vector x 

# order the elements of x in ascending 

# minimum value 

# maximum value 

# get quartiles 
75% 100% 

5.50 8.00 


order 


> quantile(x, probs=c(.15, .25, .99)) # specific quantiles 

15% 25% 99% 

1.75 2.25 7.90 

> x[c(2,4)] <- NA # redefine x 

> x 


[1] 6 NA 1 NA 3 4 

> is.na(x) # identify missing elements 

[1] FALSE TRUE FALSE TRUE FALSE FALSE 

> df <- data.frame(x=c(1, 2, 3, 4), y=c(2, 1 , 3 , 2)) 

> df 


x 

1 1 
2 2 

3 3 

4 4 

> nrow(df) 
[1] 4 

> ncol(df) 
[ 1 ] 2 


# number of rows in matrix/data frame 

# number of columns in matrix/data frame 


Several of the functions used above take optional arguments. In par¬ 
ticular the na . rm option can be set to TRUE to ensure that missing ob¬ 
servations are disregarded in the computations as shown in the follow¬ 
ing. Likewise, setting the decreasing=TRUE option for the sort and 
order functions reverses the order. 
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> X 

[1] 6 NA 1 NA 3 4 

> sum(x) 

[1] NA 

> sum(x, na.rm=TRUE) 

[1] 14 

> mean(x) 

[1] NA 

> mean(x, na.rm=TRUE) 
[1] 3.5 

> order(x) 

[1] 3 5 6 1 2 4 

> sort(x) 

[1] 1 3 4 6 

> sort(x, na.last=TRUE) 
[1] 1 3 4 6 NA NA 

> sort(x, na.last=TRUE, 
[1] 6 4 3 1 NA NA 


# vector with missing values 

# sum returns missing 

# unless we remove the NA's 

# mean also returns missing 

# unless we remove the NA's 

# order places NA's last 

# sort removes NA altogether 

# unless we want to keep them 
decreasing=TRUE) 


2.3 Working with dates 

Problem: You have one or more character vectors that contain charac¬ 
ter strings representing dates and you want R to handle them as dates 
and not as character vectors. 

Solution: Internally in R, dates are stored as Date objects and are 
represented as the number of days since January 1,1970, with negative 
values for earlier dates. 

The format of date strings may vary greatly, for example in the ordering 
of month, day, and year, in how months are represented (as numbers 
(1-12), abbreviated strings ("Feb"), or unabbreviated strings ("Febru¬ 
ary")), whether or not leading zeroes are used for numbers less that 10, 
and the number of digits used to denote years. 

The function as .Date converts a character string to R's Date class. 
The actual format of the date is set with the format option where the 
symbols shown in Table 2.3 can be combined to specify the actual input 
format. Spaces, slashes, and hyphens should be placed between the 
month, day, and year symbols as they appear in the input string. R will 
first try the format "%Y-%m-%d" and subsequently "%Y/%m/%d" and 
see if either of these input formats will result in a proper conversion for 
the first non-missing element if the format option is not specified. An 
error is given if the conversion is not possible. The Sys . Date function 
returns the current date. 
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Table 2.3: Some of the symbols available for conversion of character strings to 
dates 


Symbol 

Interpretation 

Example 

%d 

day as a number 

21 

%a 

abbreviated weekday 

Fri 

%A 

unabbreviated weekday 

Thursday 

%m 

month as a number 

09 

%b 

abbreviated month 

Jan 

%B 

unabbreviated month 

February 

%y 

2-digit year 

10 

%Y 

4-digit year 

2011 


> today <- Sys.DateO # Current date 

> today 

[1] "2010-08-20" 

> datel <- as.Date("1971-08-20") 

> date2 <- as.Date("2011-04-01") 

> date2 - datel # Difference between dates 

Time difference of 14469 days 

> # Read dates in the format m/d/y' 

> date3 <- c("02/28/00", "12/20/01") 

> as.Date(date3) 

Error in charToDate(x) : 

character string is not in a standard unambiguous format 

> as.Date(date3, "%m/%d/%y") 

[1] "2000-02-28" "2001-12-20" 


By default, R prints dates as "(4-digit year)-(2-digit month)-(2-digit 
day)" once a character string has been converted to a date, but that 
can be changed with the format function. The format function can 
output a date in any desired format by supplying the format function 
with an output format given by the symbols in Table 2.3. 

> format(today, "%d %B %Y") # Print current date w text for month 
[1] "20 August 2010" 

> format(today, "%a %d") 

[1] "Fri 20" 

See also: The help page for strptime to see all possible conversion 
symbols. The DateTimeClasses if you need to work with time and 
dates. 
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2.4 Working with character vectors 

Problem: You have a character vector and want to modify or get infor¬ 
mation from the character strings. 

Solution: R has numerous functions that work on character vectors. 
Here we will give an example of some of the functions that operate on 
character strings. 

The toupper and tolower functions translate each element of the in¬ 
put character vector to uppercase or lowercase, respectively, 
nchar takes a character vector as input and returns a vector with the 
number of characters of each input element. 

grep is a text search function. The first argument to grep is a regu¬ 
lar expression containing the search pattern. The second argument is a 
character vector where matches are sought, grep returns a vector of in¬ 
dices where the regular expression pattern is found. The ignore.case 
option can be set to TRUE to make the search pattern case insensitive. 
The substr function can be used to extract or replace substrings in a 
character vector, substr takes a character vector as first argument, and 
has two other required arguments: start and stop which should both 
be integer values and which define the starting and ending position of 
the substring extracted (the first character is position 1). If the selected 
substring is replaced, then the replacement string is inserted from the 
start position until the stop position. 

paste accepts one or more R objects which are converted to character 
vectors and then concatenated together. The character strings are sep¬ 
arated by the string given to the sep option which defaults to a single 
space, sep=" ". If the collapse option is specified, then the elements 
in the resulting vector are collapsed together to a single string where 
the individual elements are separated by the character string given by 
collapse. 


> x <- c("Suki", "walk", 

> toupper(x) 

[1] "SUKI" "WALK" "TO" 

> tolower(x) 

[1] "suki" "walk" "to" 

> nchar(x) 

[1] 442342 

> grep("a", x) 

[ 1 ] 2 

> grep("o", x) 

[1] 3 

> substr(x, 2, 3) 

[1] "uk" "al" "o" "he" 


"to", "the", "well", NA) 

# convert to upper case 
"THE" "WELL" NA 

# convert to lower case 
"the" "well" NA 

# no. of characters in each element of x 

# elements that contain letter "a" 

# elements that contain letter "o" 

# extract 2nd to 3rd characters 
"el" NA 
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> substr(x, 2 , 3) <- "X" # replace 2nd char with X 

> x 

[1] "SXki" "wXlk" "tX" "tXe" "wXll" NA 

> substr(x, 2 , 3) <- "XXXX" 

> x 

[1] "SXXi" "wXXk" "tX" "tXX" "wXXl" NA 

> paste (x, "y") # paste "y" to end of each element 

[1] "SXXi y" "wXXk y" "tX y" "tXX y" "wXXl y" "NA y" 

> paste(x, "y", sep="-") 

[1] "SXXi-y" "wXXk-y" "tX-y" "tXX-y" "wXXl-y" "NA-y" 

> paste(x, "y", sep="-", collapse=" ") 

[1] "SXXi-y wXXk-y tX-y tXX-y wXXl-y NA-y" 


2.5 Find the value of x corresponding to the maximum 
or minimum of y 

Problem: You have two numeric vectors, x and y, of the same length 
and want to find the value of x where y attains its maximum value. 

Solution: The which, max function returns the index of the (first) max¬ 
imum of a vector, which. max can be used directly to return the value 
of a vector x corresponding to the location of the maximum of a vector 
y of the same length. 

In the code below we wish to find the location of the highest peak of a 
function evaluated at different positions. 

> x <- seq(0, pi/2, .2) 

> x 

[1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 

> y <- sin(2*x) 

> y 

[1] 0.0000000 0.3894183 0.7173561 0.9320391 0.9995736 
[6] 0.9092974 0.6754632 0.3349882 

> x[which.max(y)] 

[ 1 ] 0.8 

See also: The which. min function returns the index of the first mini¬ 
mum of a vector. 
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2.6 Check if elements in one object are present in an¬ 
other object 

Problem: You want to check which of the elements in a vector, ma¬ 
trix, list, or data frame are present in another vector, matrix, list or data 
frame. 

Solution: The %in% operator checks if each of the elements in the ob¬ 
ject on the left-hand side has a match among the elements in the object 
on the right-hand side. The operator converts factors to character vec¬ 
tors before matching, and returns a logical vector that states the match 
for each element. 

> x <- c (1, 2, 3, 1, NA, 4) 

> y <- c(l, 2 , 5, NA) 

> x %in% y 

[1] TRUE TRUE FALSE TRUE TRUE FALSE 

> y %in% x 

[1] TRUE TRUE FALSE TRUE 

> ! x %in% y # Invert result: find objects NOT present 
[1] FALSE FALSE TRUE FALSE FALSE TRUE 

The %in% operator also works for matrices, data frames and lists. Ma¬ 
trices are converted to vectors before matching. Lists and data frames 
are matched based on their individual elements. 

> z <- matrix(1:6, ncol=2) 

> z 

[, 1 ] [, 2 ] 

[If] 1 4 

[2,] 2 5 

[3, ] 3 6 

> x %in% z 

[1] TRUE TRUE TRUE TRUE FALSE TRUE 

> list(l, 2 , 3) %in% list (a=c (1,2), b=3) 

[1] FALSE FALSE TRUE 

> data.frame(a=c(1,2), b=c(3,4)) %in% data.frame(a=c(2,1), 

+ d=c(1,2)) 

[1] TRUE FALSE 


2.7 Transpose a matrix (or data frame) 

Problem: You want to transpose a matrix to swap columns and rows 
around. 
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Solution: The transpose function, t, transposes a matrix by inter¬ 
changing the rows and columns of the matrix. The result is returned 
as a matrix. 

> m <- matrix(1:6, ncol=2) # Create matrix 

> m 

[, 1 ] [, 2 ] 

[1,] 1 4 

[2,] 2 5 

[3,] 3 6 

> t(m) # Transpose 

[,1] [, 2] [, 3] 

[1,] 123 

[2,] 456 

The transpose function also works on data frames but it is generally a 
bad idea to transpose data frames because R converts the data frame 
to a matrix before transposing. As a consequence the result will be a 
character matrix if there is any non-numeric columns in the original 
data frame. 

The following example shows how the transpose function swaps rows 
and columns around on an artificial data frame. 

> mpdat <- data.frame(source=c("KU", "Spain", "Secret", "Attic"), 


+ rats=c(97, 119, 210, 358)) 

> mpdat 

source rats 

1 KU 97 

2 Spain 119 

3 Secret 210 

4 Attic 358 


> t(mpdat) 

[,1] [,2] [, 3] [,4] 

source "KU" "Spain" "Secret" "Attic" 

rats " 97" "119" "210" "358" 

The data frame has now been switched around but note that the values 
are enclosed in quotes as if they were text. 

The rows of a matrix cannot have different types in R so it is necessary 
to remove character and factor columns from the original data frame 
before transposition if the numeric values of the transposed data frame 
are to be used. 

> t(mpdat[,2]) 

[,1] [, 2] [,3] [, 4] 

[1,] 97 119 210 358 
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2.8 Impute values using last observation carried forward 

Problem: You have a vector of observations from a longitudinal study 
and want to use the last observation carried forward method for each 
individual to impute data for missing values. 

Solution: A common occurrence in longitudinal studies is dropouts 
where an individual drops out of the study before all responses can 
be obtained. Measurements obtained before the individual dropped 
out can be used to impute the unknown measurement(s) and the last 
observation carried forward method is one way to impute values for 
the missing observations. For the last observation carried forward ap¬ 
proach the missing values are replaced by the last observed value of 
that variable for each individual regardless of when it occurred. 

The zoo package provides the na . locf function which replaces NA's 
with the most recent non-NA prior to it. The input to na . locf can be 
either a vector or a data frame, and if it is a data frame then the function 
will be applied to each of the variables in the data frame. 

The option f romLast can be set to TRUE to make na . locf carry ob¬ 
servations backward instead of forward. The maxgap option sets the 
maximum number of consecutive NA's to fill (defaults to infinite). The 
na . rm options defaults to TRUE and causes leading NA's to be removed. 
It can be set to FALSE to keep leading NA's in the data. 

Below we create a data frame where three individuals are measured at 
four time points. We wish to use the last observation carried forward 
approach for the variable x for each individual. In order to do that, 
we first sort the data frame according to individual and then time (to 
ensure that observations are indeed carried forward in time) and we 
use the by function to carry observations forward for each individual. 

> # Start by creating a data frame with missing observation 

> id <- factor(rep(c("A", "B", "C"),4)) 

> time <- rep(1:4, times=3) 

> x <- c (1, 2, 3, 4, 5, NA, NA, NA, NA, NA, NA, 6) 

> mydata <- data.frame(id, time, x) 

> mydata # Data frame 

id time x 

1 A 11 

2 B 2 2 

3 C 3 3 

4 A 4 4 

5 B 15 

6 C 2 NA 

7 A 3 NA 

8 B 4 NA 

9 C 1 NA 
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10 A 2 NA 

11 B 3 NA 

12 C 4 6 

> o <- order(mydata$id, mydata$time) 

> newdata <- mydata[o,] # Ordered data frame 

> newdata 

id time x 

1 A 11 

10 A 2 NA 

7 A 3 NA 

4 A 4 4 

5 B 15 

2 B 2 2 

11 B 3 NA 

8 B 4 NA 

9 C 1 NA 

6 C 2 NA 

3 C 3 3 

12 C 4 6 

> library(zoo) 

> # Just using na.locf will carry observations forward 

> # across individuals as the following line show 

> na.locf(newdata$x) # Not working correctly 

[1] 111452222236 

> # Create function that uses na.locf on the correct variable: x 

> result <- by(newdata, newdata$id, 

+ function(df) { na.locf(df$x, na.rm=FALSE)}) 

> result 
newdata$id: A 
[1] 1114 


newdata$id: B 
[1] 5 2 2 2 


newdata$id: C 
[1] NA NA 3 6 

> newdata$newx <- unlist(result) # unlist and add to data frame 

> newdata 

id time x newx 

1 A 11 1 

10 A 2 NA 1 

7 A 3 NA 1 

4 A 4 4 4 

5 B 15 5 

2 B 2 2 2 

11 B 3 NA 2 

8 B 4 NA 2 

9 C 1 NA NA 

6 C 2 NA NA 

3 C 3 3 3 

12 C 4 6 6 
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Careful considerations should be used since the last observation carried 
forward approach may result in biased estimates and may underesti¬ 
mate the variability 


2.9 Convert comma as decimal mark to period 

Problem: You have a vector of observations where comma is used as 
the decimal mark. R sees the values of the vector as character strings 
and you want to convert it to numeric values. 


Solution: R uses the period as the decimal mark but in many countries 
the comma is used as the decimal mark. When data are imported into 
R the decimal point can often be specified, for example with the dec 
option to read .table or read. csv. 

If you have a vector of observations that are numbers using comma as 
the decimal mark, then R will view the vector either as a factor or as a 
vector of character strings. In order to convert the values to numeric, 
we use the sub function to substitute the period decimal point for the 
comma and then call as . numeric to convert the resulting vector to nu¬ 
meric values, sub takes three arguments: the string pattern that should 
be matched, the character string that should be substituted for it, and 
the character vector on which it should be applied. 


> x <- c (" 1 , 2 ", "1,5", "1,8", "1.9", "2") # Example data 

> as.numeric(sub(",", ".", x)) # Substitute , with . and convert 

[1] 1.2 1.5 1.8 1.9 2.0 

> f <- factor(x) # It also works on factors 

> as.numeric(sub(",", ".", f)) # Substitute , with . and convert 

[1] 1.2 1.5 1.8 1.9 2.0 

> as.numeric(x) # Does not work - need to substitute first 

[1] NA NA NA 1.9 2.0 

Warning message: 

NAs introduced by coercion 


Since sub returns a character vector and not a factor we can use the 
function as . numeric directly on the result from sub unlike the situa¬ 
tion covered in Rule 2.20. 
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2.10 Select a subset of a dataset 

Problem: You want to select a subset of cases of a data frame based on 
some criteria. 

Solution: Specific cases of a data frame can be selected using the built- 
in index functions in R. Alternatively, the subset function can be used 
to define the criteria to select cases from a data frame. The first argu¬ 
ment to subset is the name of the original data frame and the subset 
argument is a logical statement that identifies the desired cases. The 
logical statement can contain variables from the original data frame. 
For example, if we wish to select the cases from the ToothGrowth data 
where the dose is less than 2 we do as follows: 


> data(ToothGrowth) 

> subset(ToothGrowth, dose < 2) 


1 

len 

4.2 

supp 

VC 

dose 

0.5 

2 

11.5 

VC 

0.5 

3 

7.3 

VC 

0.5 

4 

5.8 

VC 

0.5 

5 

6.4 

VC 

0.5 


47 

25.8 

OJ 

o 

i—i 

48 

21.2 

OJ 

o 

i—i 

49 

14.5 

OJ 

o 

I—1 

50 

27.3 

OJ 

o 

I—1 


Several logical statements may be combined to achieve a more complex 
subsetting. If we want to select observations with dose less than 2 and 
where the supplement type is VC we use the command: 


> subset(ToothGrowth, dose < 2 & supp == "VC") 


1 

len 

4.2 

supp 

VC 

dose 

0.5 

2 

11.5 

VC 

0.5 

3 

7.3 

VC 

0.5 

4 

5.8 

VC 

0.5 

5 

6.4 

VC 

0.5 

19 

18.8 

VC 

1.0 

20 

15.5 

VC 

1.0 


www.Ebook777.com 




Manipulating data 


41 


The select argument to the subset function enables you to extract 
specific variables from the original data frame. To select only the len 
and supp variables from the ToothGrowth data frame when we are 
only interested in observations with a dose less than 2 and where the 
supplement type is VC we type: 

> subset(ToothGrowth, dose < 2 & supp == "VC", 

+ select=c("len", "supp")) 



len 

supp 

1 

4.2 

VC 

2 

11.5 

VC 

3 

7.3 

VC 

4 

5.8 

VC 

5 

6.4 

VC 

19 

18.8 

VC 

20 

15.5 

VC 


Note that when a factor is subsetted, it will not automatically drop lev¬ 
els that are no longer present. That can result in tables with empty and 
irrelevant cells and figures with empty bars, etc. Use Rule 2.23 to re¬ 
move unused factor levels. 


2.11 Select the complete cases of a dataset 

Problem: You want to select a subset of cases of a data frame for which 
there are no missing values. 

Solution: The complete . cases function returns a logical vector in¬ 
dicating which cases are complete. The function takes a sequence of 
vectors, matrices and data frames as arguments. 

The New York daily air quality measurements from the airquality 
dataset contains missing values for the ozone and solar radiation vari¬ 
ables. 


> data(airquality) 

> head(airquality) 



Ozone 

Solar.R 

Wind 

Temp 

Month 

Day 

1 

41 

190 

7.4 

67 

5 

1 

2 

36 

118 

8.0 

72 

5 

2 

3 

12 

149 

12.6 

74 

5 

3 

4 

18 

313 

11.5 

62 

5 

4 

5 

NA 

NA 

14.3 

56 

5 

5 

6 

28 

NA 

14.9 

66 

5 

6 


> complete <- complete.cases(airquality) 
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> complete 

[1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE 

[ More data lines here ] 

[145] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE 

> ccdata <- airquality[complete, ] 

> head(ccdata) 



Ozone 

Solar.R 

Wind 

Temp 

Month 

Day 

1 

41 

190 

7.4 

67 

5 

1 

2 

36 

118 

8.0 

72 

5 

2 

3 

12 

149 

12.6 

74 

5 

3 

4 

18 

313 

11.5 

62 

5 

4 

7 

23 

299 

8.6 

65 

5 

7 

8 

19 

99 

13.8 

59 

5 

8 


If we know that we will only use some of the variables from a data 
frame, then the complete cases selection should be based on those vari¬ 
ables only. For example, if we are only interested in the ozone and tem¬ 
perature values, then we would specify these variables as arguments to 
complete . cases function instead of the whole data frame. 

> complete2 <- complete.cases(airquality$Ozone, airquality$Temp) 

> oztemp <- airquality[complete2,] 

> head(oztemp) 



Ozone 

Solar.R 

Wind 

Temp 

Month 

Day 

1 

41 

190 

7.4 

67 

5 

1 

2 

36 

118 

8.0 

72 

5 

2 

3 

12 

149 

12.6 

74 

5 

3 

4 

18 

313 

11.5 

62 

5 

4 

6 

28 

NA 

14.9 

66 

5 

6 

7 

23 

299 

8.6 

65 

5 

7 


The complete . cases function is particularly useful with respect to 
model reduction when two statistical models are compared since the fit 
of the two models should be based on the same data. 


2.12 Delete a variable from a data frame 

Problem: You want to delete a variable from a data frame. 

Solution: There exists several functions in various packages to R that 
allows you to rename and delete variables in a data frame. However, 
you can remove a variable in a data frame directly simply by setting it 

to NULL. 

> mydat <- data.frame(x=l:5, y=factor(c("A", "B", "A", "B", NA))) 
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> mydat 

x y 

11 A 

2 2 B 

3 3 A 

4 4 B 

5 5 <NA> 

> mydat$y <- NULL # Remove the y variable from mydat 

> mydat 
x 

1 1 
2 2 

3 3 

4 4 

5 5 


2.13 Join datasets 

Problem: You want to join two data frames that contain the same vari¬ 
ables but consists of different cases. 

Solution: The rbind function merges two data frames "vertically" 
by combining the rows from the two datasets. The two data frames 
must have the same variables (and should be of the same types), but 
the variables do not have to be in the same order. 

> dfl <- data.frame(id=l:3, x=ll:13, y=factor(c("A", "B", "C"))) 

> dfl 

id x y 

1 1 11 A 

2 2 12 B 

3 3 13 C 

> df2 <- data.frame(id=6:10, y=factor(1:5), x=l:5) 

> df 2 

id y x 
1 6 11 

2 7 2 2 

3 8 3 3 

4 9 4 4 

5 10 5 5 

> combined <- rbind(dfl, df2) 

> combined 
id x y 

1 1 11 A 

2 2 12 B 

3 3 13 C 

4 6 11 
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5 7 2 2 

6 8 3 3 

7 9 4 4 

8 10 5 5 

If one of the data frames contains variables that the other does not, then 
there are two possible options before joining them: Either delete the 
extra variables that only exist in one of the two data frames (see Rule 
2.12), or create the missing variable(s) and set them to NA (missing) for 
the data frame where they are not present. The two possible solutions 
are shown below. 

> dfl <- data.frame(id=l:3, x=ll:13, y=factor(c("A", "B", "C")), 

+ z=l:3) 

> df2 <- data.frame(id=6:10, y=factor(1:5), x=l:5) 

> dfl$z <- NULL # Remove the z variable from data frame dfl 

> rbind(df1,df2) # Now we can join the data frames 

id x y 

1 1 11 A 

2 2 12 B 

3 3 13 C 

4 6 11 

5 7 2 2 

6 8 3 3 

7 9 4 4 

8 10 5 5 

> dfl <- data.frame(id=l:3, x=ll:13, y=factor(c("A", "B", "C")), 

+ z=l:3) 

> df2 <- data.frame(id=6:10, y=factor (1:5), x=l:5) 

> df2$z <- NA # Add a vector z of NAs to data frame df2 

> rbind(dfl,df2) # Now we can join the data frames 

id x y z 

1 1 11 A 1 

2 2 12 B 2 

3 3 13 C 3 

4 6 1 1 NA 

5 7 2 2 NA 

6 8 3 3 NA 

7 9 4 4 NA 

8 10 5 5 NA 


2.14 Merge datasets 

Problem: You want to merge two data frames that have the same cases 
but contain different variables. 
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Solution: You have two sets of data frames on the same set of individ¬ 
uals and want to merge them "horizontally" by combining the columns 
of the two datasets. This can be done with the merge function. In most 
cases, you want to merge the two data frames by one or more common 
key variables to make sure they are matched correctly. The matching 
is controlled by the by optional argument which defaults to columns 
with names that are present in both data frames. 

> dfx <- data.frame(id=factor(1:5), varl=ll:15, 

facl=factor(c("A", "B", "C", "D", "E"))) 

> dfy <- data.frame(id=factor(6:1), var2=15:10) 

> merge(dfx, dfy, by="id") 
id varl facl var2 

1 1 11 A 10 

2 2 12 B 11 

3 3 13 C 12 

4 4 14 D 13 

5 5 15 E 14 


Note how the unmatched observation from dfy with id= 6 is discarded 
from the combined data frame since the id does not appear in both 
data frames. If we wish to keep rows that only appear in one of the two 
data frames, we should use the all. x=TRUE or all. y=TRUE. The two 
options ensure that all rows from the first or second data frame, respec¬ 
tively, are kept in the combined data frame. Rows that only appear in 
one of the two data frames have their remaining variables set to NA. 

> merge(dfx, dfy, by="id", all.y=TRUE) 
id varl facl var2 


1 

1 

11 

A 

10 

2 

2 

12 

B 

11 

3 

3 

13 

C 

12 

4 

4 

14 

D 

13 

5 

5 

15 

E 

14 

6 

6 

NA 

<NA> 

15 


2.15 Stack the columns of a data frame together 

Problem: You have a data frame and want to stack the individual 
columns together to form a new data frame. 

Solution: Observations from different groups, situations or conditions 
are sometimes stored as separate variables in a data frame. 

The stack function takes the individual columns and transforms them 
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into a new data frame with two columns: one that contains the original 
values and one that contains the group (i.e., the column name from 
the original data frame). The new data frame can be used directly in 
an analysis of variance model or other linear model, where data are 
required to be of this form. 

The stacked data frame has exactly two variables named values and 
ind which contain the values and groups, respectively. Below we show 
how stack is used to transform the data so we can analyze them using 
a linear model. 

> df <- data.frame(x=c(1, 2, 3, 4, 5), y=c(3, NA, 6, 7, 8)) 

> df 

x y 
113 

2 2 NA 

3 3 6 

4 4 7 

5 5 8 

> newdf <- stack(df) 

> newdf 

values ind 

1 lx 

2 2 x 

3 3 x 

4 4 x 

5 5 x 

6 3 y 

7 NA y 

8 6 y 

9 7 y 

10 8 y 

> t.test(df$x, df$y, var.equal=TRUE) # T-test for two samples 

Two Sample t-test 
data: df$x and df$y 

t = -2.4152, df = 7, p-value = 0.04642 
alternative hypothesis: true difference in means 
is not equal to 0 
95 percent confidence interval: 

-5.93714236 -0.06285764 
sample estimates: 
mean of x mean of y 
3 6 

> summary(lm(values ~ ind, data=newdf)) # One-way anova 

Call: 

lm(formula = values ~ ind, data = newdf) 


Residuals: 
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Min IQ Median 3Q Max 

-3-1012 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 3.0000 0.8281 3.623 0.00848 ** 

indy 3.0000 1.2421 2.415 0.04642 * 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

Residual standard error: 1.852 on 7 degrees of freedom 
(1 observation deleted due to missingness) 

Multiple R-squared: 0.4545, Adjusted R-squared: 0.3766 

F-statistic: 5.833 on 1 and 7 DF, p-value: 0.04642 

See also: unstack reverses the stack operation. 


2.16 Reshape a data frame from wide to long format or 
vice versa 

Problem: You have a data frame that is in "wide" format and want 
to transform it to the "long" format usually required by R functions or 
vice versa. 

Solution: Sometimes data are stored in a compact format that gives 
all the covariates or explanatory variables for each subject followed by 
all the observations on that subject. This setup is quite common for 
example for longitudinal data, where there are repeated measurements 
over time on each individual. 

The "wide" format has one row for each individual with some time- 
constant variables that occupy single columns and some time-varying 
variables that occupy a column for each time point. Many of R's func¬ 
tions require data frames to be in the "long" format where there is only 
one observation/record on each row. Thus, in the "long" format each 
individual may result in multiple records /rows and time-constant vari¬ 
ables will be constant across these records, while the time-varying vari¬ 
ables will change across the records. In the presence of time-varying 
variables, the "long" format dataset needs an "id" variable to identify 
which records refer to the same individual and a "time" variable to 
show which time point each record comes from. If there are no time- 
varying covariates, the two formats will be identical. 

Consider the following dataset in "wide" format where we have ob¬ 
servations on five individuals (indicated by the person column) time- 
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constant covariates age and gender and time-varying observations 
yl, y4 that represent our variable of interest measured at 4 dif¬ 
ferent time points. 11,..., 14 indicate the actual times (days) when the 
individuals were measured, and we can see that this differs between 
individuals. We would like to convert this dataset to the "long" format. 


person age gender 
I 20 Male 
II 32 Male 
III 20 Female 
IV 20 Male 
V 20 Male 


tl t2 t3 t4 yl y2 y3 y4 
0 1 2 3 10 12 15 16 

0 2 4 5 10 12 15 16 

0 1 2 3 10 NA 15 16 

0 1 2 3 10 12 15 16 

0 1 NA 3 10 12 15 16 


The reshape function transforms data frames between the "wide" and 
"long" formats and vice versa. In order to transform a data frame 
from "wide" to "long" format, we should at least provide the varying 
and direction=" long" options besides the original data frame when 
calling reshape, varying names the set(s) of time-varying variables 
that correspond to a single variable in the long format. The varying 
option should either be a single vector of column names (or indices) 
if there is only one time-varying variable or a list of vectors of variable 
names or a matrix of names if there are multiple time-varying variables. 
Two other often-used options when converting from "wide" to "long" 
format are idvar and v. names, idvar sets the name of the covariate 
that identifies which observations that belong to the same individual in 
the "long" format, and the default name is id unless something else is 
specified, v. names can be used to provide a vector of variable names 
in the long format that correspond to multiple columns in the "wide" 
format. If nothing is specified, then R will try to guess variable names 
in the long format. 

In the code below we assume that the data frame listed above is stored 
in a text file named widedata . txt. We wish to transform the "wide" 
data to the "long" format and make sure that 11, ..., 14 are converted 
to a time-varying covariate named day and that yl, ..., y4 are con¬ 
verted to a time-varying covariate entitled response. 

> indata <- read.table("widedata.txt", header=TRUE) 

> indata 



person 

age 

gender 

ti 

t2 

t3 

t4 

yi 

y2 

y3 

y4 

1 

I 

20 

Male 

0 

1 

2 

3 

10 

12 

15 

16 

2 

II 

32 

Male 

0 

2 

4 

5 

10 

12 

15 

16 

3 

III 

20 

Female 

0 

1 

2 

3 

10 

NA 

15 

16 

4 

IV 

20 

Male 

0 

1 

2 

3 

10 

12 

15 

16 

5 

V 

20 

Male 

0 

1 

NA 

3 

10 

12 

15 

16 


> long <- reshape(indata, 

+ varying=list(c("yl", "y2", "y3", "y4"), 

+ c("tl", "12", "13", "14")), 
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+ v.names=c("response", "day"), 

+ direction="long") 


> long 



person 

age 

gender 

time 

response 

day 

id 

1.1 

I 

20 

Male 

1 

10 

0 

1 

2.1 

II 

32 

Male 

1 

10 

0 

2 

3.1 

III 

20 

Female 

1 

10 

0 

3 

4 .1 

IV 

20 

Male 

1 

10 

0 

4 

5.1 

V 

20 

Male 

1 

10 

0 

5 

1.2 

I 

20 

Male 

2 

12 

1 

1 

2.2 

II 

32 

Male 

2 

12 

2 

2 

3.2 

III 

20 

Female 

2 

NA 

1 

3 

4.2 

IV 

20 

Male 

2 

12 

1 

4 

5.2 

V 

20 

Male 

2 

12 

1 

5 

1.3 

I 

20 

Male 

3 

15 

2 

1 

2.3 

II 

32 

Male 

3 

15 

4 

2 

3.3 

III 

20 

Female 

3 

15 

2 

3 

4.3 

IV 

20 

Male 

3 

15 

2 

4 

5.3 

V 

20 

Male 

3 

15 

NA 

5 

1.4 

I 

20 

Male 

4 

16 

3 

1 

2.4 

II 

32 

Male 

4 

16 

5 

2 

3.4 

III 

20 

Female 

4 

16 

3 

3 

4 . 4 

IV 

20 

Male 

4 

16 

3 

4 

5.4 

V 

20 

Male 

4 

16 

3 

5 


The long data frame now contains the same data in the "long" for¬ 
mat and it can be used directly with most of R's modelling functions. 
Apart from the time-constant covariates from the original dataset and 
the new time-varying variables then reshape has created two addi¬ 
tional covariates: id and time. 

We could have set idvar="person" to prevent reshape from creat¬ 
ing a new variable, id, since we already had a variable in the dataset 
that could be used to identify individuals. In the long data frame we 
see that the ys from the original dataset are combined in the vector 
response and the ts are combined in the variable time. 

The direction="wide" option should be set if we wish to convert a 
dataset from "long" to "wide" format. When converting to the "wide" 
format, it is important to specify both the idvar and the timevar op¬ 
tions so R knows how to identify observations from a single individual 
and how to order the time-varying measurements for each individual. 
Unless time-varying variables are specified with the v. names option 
then reshape assumes that all variables — except those used with 
idvar and timevar — are time-varying. 

> reshape(long, idvar="person", timevar="time", 

+ v.names=c("day", "response"), 

+ direction="wide") 

person age gender id day.l response.1 day.2 response.2 
1.1 I 20 Male 10 10 1 12 
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2.1 

II 

32 

Male 

2 

0 

10 

2 

12 

3.1 

III 

20 

Female 

3 

0 

10 

1 

NA 

4 .1 

IV 

20 

Male 

4 

0 

10 

1 

12 

5.1 

V 

20 

Male 

5 

0 

10 

1 

12 


day.3 : 

response.3 ■ 

day, 

.4 response.4 



1.1 

2 


15 


3 

16 



2.1 

4 


15 


5 

16 



3.1 

2 


15 


3 

16 



4 .1 

2 


15 


3 

16 



5.1 

NA 


15 


3 

16 




> # Reshape but specify names of variables in wide format 

> reshape (long, idvar="person" , timevar="time", 

+ v.names=c("day", "response"), 

+ varying=list(c("yl", "y2", "y3", "y4"), 

+ c("tl", "t2", " 13 ", "14")), 

+ direction="wide") 



person 

age 

gender 

id 

yi 

ti 

y2 

t2 

y3 

t3 

y4 

t4 

1.1 

I 

20 

Male 

1 

0 

10 

1 

12 

2 

15 

3 

16 

2.1 

II 

32 

Male 

2 

0 

10 

2 

12 

4 

15 

5 

16 

3.1 

III 

20 

Female 

3 

0 

10 

1 

NA 

2 

15 

3 

16 

4 .1 

IV 

20 

Male 

4 

0 

10 

1 

12 

2 

15 

3 

16 

5.1 

V 

20 

Male 

5 

0 

10 

1 

12 

NA 

15 

3 

16 


The output above shows that the timevar variable used to define the 
time-varying measurements from each individual is removed from the 
dataset since measurements (and their order) are now all gathered on a 
single line for each individual. 


2.17 Create a table of counts 

Problem: You wish to create a contingency table of counts by cross¬ 
classifying one or more factors. 

Solution: The table function is used to cross-classify and count the 
number of observations for each classification category. As input, the 
table function takes one or more objects that can be interpreted as 
factors and then calculates the counts for each combination of factor 
levels. 

> x <- rbinom(20, size=4, p=.5) 

> x 

[1] 13234431122220410112 

> table(x) 
x 

0 12 3 4 
2 6 6 3 3 

> sex <- sample(c("Male", "Female"), 20, replace=TRUE) 
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> sex 


[1] 

"Female" 

"Female" 

"Male" 

"Female" 

"Male" 

"Male" 

[7] 

"Male" 

"Male" 

"Male" 

"Female" 

"Female" 

"Female 

[13] 

[19] 

"Male" 

"Male" 

"Male" 

"Female" 

"Male" 

"Male" 

"Male" 

"Female 


> table(sex, x) 

x 

sex 01234 

Female 02420 
Male 24213 

> treatment <- rep(c("Placebo", "Active"), 10) 

> treatment 


[i] 

"Placebo" 

"Active" 

"Placebo" 

"Active" 

"Placebo" 

"Active 

[7] 

"Placebo" 

"Active" 

"Placebo" 

"Active" 

"Placebo" 

"Active 

[13] 

"Placebo" 

"Active" 

"Placebo" 

"Active" 

"Placebo" 

"Active 


[19] "Placebo" "Active" 

> table(sex, x, treatment) 
, , treatment = Active 

x 

sex 01234 
F 0 1 2 0 0 
M 1 2 1 2 1 

, , treatment = Placebo 


x 

sex 01234 
F 0 1 3 1 1 
M 1 2 0 0 1 


R outputs several two-dimensional tables for multi-way tables higher 
than two dimensions. In the last example above, we have a three- 
dimensional table so the relevant two-dimensional table, sex by x, is 
printed for every other combination of values of the remaining dimen¬ 
sions. Multi-way tables more easily printed using the ft able func¬ 
tion. The Titanic dataset contains information on survival of 2201 
passengers on board the Titanic and the data are classified according to 
economic status (class), sex, and age group (child or adult). This four¬ 
dimensional array can be very compactly printed using f table. 


> data(Titanic) 

> ftable(Titanic) 


Sex 

Survived 

Age 

No 

Yes 

Male 

Child 

0 

5 


Adult 

118 

57 

Female 

Child 

0 

1 


Adult 

4 

140 

Male 

Child 

0 

11 


Adult 

154 

14 

Female 

Child 

0 

13 


Adult 

13 

80 
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3rd Male Child 

Adult 
Female Child 
Adult 

Crew Male Child 

Adult 
Female Child 
Adult 


387 75 


670 192 


35 13 


17 14 

89 76 

0 0 


0 0 
3 20 


See also: The margin . table function can be used to compute the sum 
of table entries (e.g., row or column sums), while prop . table can be 
used to compute entries in a contingency table as fractions of a table 
margin. The xtabs function can create contingency tables from factors 
using a formula interface. 


2.18 Convert a table of counts to a data frame 

Problem: You have a contingency table of counts and wish to convert 
the counts to a data frame. 

Solution: Sometimes data are provided as a table of counts and it may 
be necessary to convert the counts to a data frame. The as . data . frame 
function converts a table representing the cross-tabulation of two or 
more categorical variables to a data frame where each row in the data 
frame represents a particular combination of categories (shown by a 
column for each dimension in the table) together with the frequency of 
that combination. 

The HairEyeColor dataset consists of a cross-classification table of 
hair color, eye color and sex for 592 U.S. statistics students. 

> # Only look at males to keep output short 

> maletable <- as.table(HairEyeColor[,,1]) 

> maletable 


Eye 


Hair Brown Blue Hazel Green 

Black 32 11 10 3 

Brown 53 50 25 15 

Red 10 10 7 7 

Blond 3 30 5 8 


> freq <- as.data.frame(maletable) 

> freq 

Hair Eye Freq 

1 Black Brown 32 

2 Brown Brown 53 
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3 

Red 

Brown 

10 

4 

Blond 

Brown 

3 

5 

Black 

Blue 

11 

6 

Brown 

Blue 

50 

7 

Red 

Blue 

10 

8 

Blond 

Blue 

30 

9 

Black 

Hazel 

10 

10 

Brown 

Hazel 

25 

11 

Red 

Hazel 

7 

12 

Blond 

Hazel 

5 

13 

Black 

Green 

3 

14 

Brown 

Green 

15 

15 

Red 

Green 

7 

16 

Blond 

Green 

8 


The expand, table from the epitools package expands the contin¬ 
gency table into a data frame where each line corresponds to a single 
observation. As input, expand.table requires a table or array, x, 
which has both names (dimnames (x) ) anddimnames (x) since these 
are converted to field names and factor levels, respectively. 

> library(epitools) 

> fulldata <- expand.table(HairEyeColor) 

> fulldata 



Hair 

Eye 

Sex 

1 

Black 

Brown 

Male 

2 

Black 

Brown 

Male 

3 

Black 

Brown 

Male 

4 

Black 

Brown 

Male 

5 

Black 

Brown 

Male 

6 

Black 

Brown 

Male 


[ More data lines here ] 


584 

Blond 

Green 

Male 

585 

Blond 

Green 

Female 

586 

Blond 

Green 

Female 

587 

Blond 

Green 

Female 

588 

Blond 

Green 

Female 

589 

Blond 

Green 

Female 

590 

Blond 

Green 

Female 

591 

Blond 

Green 

Female 

592 

Blond 

Green 

Female 


2.19 Convert a data frame to a vector 

Problem: You have a data frame or matrix of values and want to con¬ 
vert all the values into a single vector. 
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Solution: You have data frame or a matrix of observations and wish to 
combine all the observations into a single vector. Data type conversion 
from data frames to vectors can be accomplished using as .matrix 
which first converts the data frame to a matrix and then use as . vector 
subsequently to convert the matrix into a vector. 

If all variables in the original data frame are numeric, then the resulting 
vector will be numeric. Otherwise, the result will be a character vector 
as shown in the following example. 

> df <- data.frame(a=rep(1,5), b=rep(2,5), c=rep(3,5)) 

> df 

a b c 
112 3 

2 12 3 

3 12 3 

4 12 3 

5 12 3 

> vector <- as.vector(as.matrix(df)) # Convert to vector 

> vector 

[1] 111112222233333 

> df <- data.frame(a=rep(1,5), 

+ b=factor(c("A", "A", "A", "B", "B"))) 

> df # Show data frame with both numeric and factor 

a b 

1 1 A 

2 1 A 

3 1 A 

4 1 B 

5 1 B 

> as.vector(as.matrix(df)) # Result is vector of strings 

[i] "1" "1" "1" "1" "1" "A" "A" "A" "B" "B" 


Factors 


2.20 Convert a factor to numeric 

Problem: You want to convert a factor to a vector of numeric values 
based on the factor labels. 

Solution: One solution is to use as . numeric (myf actor ) to convert 
my factor to a numeric vector. However, as .numeric (my factor) 
returns a numeric vector of the index of the levels and not the actual 
values. Instead we need to use the levels function to return the actual 
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labels and then convert these. This also ensures that R automatically 
converts non-numeric values to NA's and prints a warning message if 
any of the levels cannot be converted to a numeric value. 

> f <- factor(c(1:4, "A", 8, NA, "B")) 

> f 

[1] 1 2 3 4 A 8 <NA> B 

Levels: 1 2 3 4 8 A B 

> vect <- as.numeric(levels(f))[f] 

Warning message: 

NAs introduced by coercion 

> vect 

[1] 1 2 3 4 NA 8 NA NA 

> as.numeric(f) # Not correct 

[1] 1 2 3 4 6 5 NA 7 


2.21 Add a new level to an existing factor 

Problem: You want to add a new factor level to an existing factor. 

Solution: Sometimes it is necessary to add a new level to an exist¬ 
ing factor; for example, if an existing level is split up or because new 
observations that contain levels not present in the existing factor are 
introduced. 

It is not possible to add a new category directly just by assigning the 
new value to the relevant observations. Instead it is necessary to first 
redefine the factor with the additional category and then assign the new 
values to the relevant observations afterwards. 

In the example below we wish to convert the first two observations to 
a new level called new level. 

> f <- factor (c(l, 2, 3, 2, 3, 1, 3, 1, 2, 1)) 

> f 

[1] 1232313121 
Levels: 1 2 3 

> f[c(l,2)] <- "new level" # Not working 
Warning message: 

In factor'('*tmp*', c(l, 2), value = "new level") : 

invalid factor level, NAs generated 

> f 

[1] <NA> <NA> 32313121 
Levels: 1 2 3 

> # Define the new factor level first 

> f <- factor(f, levels=c(levels(f), "new level")) 
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> f[c(l,2)] <- "new level" 

> f 

[1] new level new level 3 
[7] 3 1 2 

Levels: 1 2 3 new level 


# Can now assign new levels 

2 3 1 

1 


2.22 Combine the levels of a factor 

Problem: You want to group some of the levels of an existing factor to 
create a new factor with fewer levels. 

Solution: Use the leve 1 s function to set the new levels of the original 
factor. 

> f <- factor (c(l, 1, 2, 3, 2, 1, "A", 3, 3, 3, "A")) 

> f 

[1] 112321A333A 
Levels: 1 2 3 A 

> # Define the new factor levels. We wish to combine 

> # levels 1 and 3 so assign appropriate labels for 

> # each of the original levels of the factor f 

> new.levels <- c("13", "2", "13", "A") 

> levels(f) <- new.levels 

> f 

[1] 13 13 2 13 2 13 A 13 13 13 A 

Levels: 13 2 A 


2.23 Remove unused levels of a factor 

Problem: You want to remove the levels of a factor that do not occur 
in the dataset. 

Solution: Use the dr op leve Is function on an existing factor to create 
a new factor that only contains the levels that are present in the original 
factor. 

> f <- factor(c(l, 2 , 3 , 2 , 3 , 2 , 1), levels=c(0, 1 , 2 , 3)) 

> f 

[1] 1 2 3 2 3 2 1 
Levels: 0123 
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> ff <- droplevels(f) # Drop unused levels from f 

> ff 

[1] 1 2 3 2 3 2 1 
Levels: 1 2 3 

The droplevels function can also be used on data frames, in which 
case the function is applied to all factors in the data frame. The except 
option takes a vector of indices representing columns where unused 
factor levels are not to be dropped. 


> # Create data frame with two identical factors 

> df <- data.frame(fl=f, f2=f) 

> # Drop levels from all factors in df except variable 2 

> newdf <- droplevels(df, except=2) 

> newdf$fl 

[1] 1 2 3 2 3 2 1 
Levels: 1 2 3 

> newdf$f2 

[1] 1 2 3 2 3 2 1 
Levels: 0 1 2 3 

See also: Alternatively, the factor function can be applied to an ex¬ 
isting factor to create a new factor that only contains the levels that are 
present in the original factor. 


2.24 Cut a numeric vector into a factor 

Problem: You want to divide a numeric vector into intervals and then 
convert the intervals into a factor. 

Solution: cut divides the range of a numeric vector into intervals 
which can be converted to a factor. If we want to divide the range of a 
numeric vector into n intervals of the same length, we just include n as 
the second argument to cut. 

> x <- (1:15)**2 


> X 


[1] 

1 

4 

9 

16 

25 36 4 

9 64 81 

100 

121 144 

169 

> cut(x. 

3) # 

Cut 

the range of 

x into 3 

interval of 

same 

[1] 

(0. 

776,75, 

.6] 

(0. 

776, 75.6] 

(0.776, 75, 

.6] 

(0.776, 75 

.6] 

[5] 

(0. 

776,75, 

.6] 

(0. 

776, 75.6] 

(0.776, 75, 

.6] 

(0.776, 75 

.6] 

[9] 

(75 

. 6, 150] 


(75 

. 6, 150] 

(75.6, 150] 


(75.6, 150 

] 

[13] 

(150,225] 


(150,225] 

(150,225] 




Levels: 

(0.776, 

. 75 . 

6] 

(75.6,150] 

(150,225] 
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If we wish to make categories of roughly the same size, we should di¬ 
vide the numeric vector based on its quantiles. In that case, we can use 
the quantile function, as shown: 

> # Cut the range into 3 intervals of roughly the same size 

> cut(x, quantile(x, probs=seq(0, 1, length=4)), 

+ include.lowest=TRUE) 


[1] 

[1,32.3] 

[1,32 

.3] 

[1,32.3] 

[1,32.3] 

[1,32.3] 

[6] 

(32.3,107] 

(32.3 

, 107] 

(32.3,107] 

(32.3,107] 

(32.3,107 

[11] 

(107,225] 

(107, 

225] 

(107,225] 

(107,225] 

(107,225] 

Levels: [1,32.3 

] (32. 

3,107] 

(107,225] 




Transforming variables 


2.25 Sort data 

Problem: You want to sort a vector or data frame. 

Solution: sort sorts a numeric, complex, character, or logical vector 
in ascending order. Set the option decreasing to TRUE if it should be 
sorted in decreasing order. 

> x <- c (2, 4, 1, 5, 3, 4) 

> x 

[1] 2 4 1 5 3 4 

> sort(x) 

[1] 1 2 3 4 4 5 

> sort(x, decreasing=TRUE) 

[1] 544321 

> y <- c("b", "a", "A", "b", "b", "a") 

> y 

[1] "b" "a" "A" "b" "b" "a" 

> sort(y) 

[1] "a" "a" "A" "b" "b" "b" 

If you want to find the permutation order that sorts a vector, you can 
use the order function. The order arranges the argument in ascending 
order (descending when the decreasing option is TRUE) and addi¬ 
tional arguments to order are subsequently used to break ties. Thus, 
the order can be used to sort data frames according to one or more 
variables. 

> order(x) 


www.Ebook777.com 




Manipulating data 


59 


[1] 3 1 5 2 6 4 

> df <- data.frame(x,y) 

> df 
x y 

12b 

2 4a 

3 1 A 

4 5b 

5 3b 

6 4a 

> df[order(df$y),] 
x y 

2 4a 
6 4a 

3 1 A 
12b 

4 5b 

5 3b 

> df[order(df$y, df$x),] 
x y 

2 4a 

6 4a 

3 1 A 
12b 
5 3b 

4 5b 


2.26 Transform a variable 

Problem: You want to log transform (or use some other transforma¬ 
tion) a vector or a variable in a data frame. 

Solution: It is often necessary to create a new variable based on the 
existing variable(s) or to transform a variable for example to change its 
scale. 

In R this is done simply by applying the function that creates the new 
variable from the existing variables. If the new variable is to be part of 
a data frame, then the $-operator can be used to store the new variable 
inside a data frame. 

Alternatively, the transform function can be used to create a new 
data frame from an existing data frame and at the same time define 
or redefine variables inside the new data frame. As its first argument 
transform takes the name of an existing data frame, and changed 
variables are put as additional arguments of the form tag=value as 
shown in the following. 
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> data(airquality) 

> names(airquality) 

[1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day" 

> # New variable 

> Celsius <- (airquality$Temp-32) /I . 8 

> # Store the variable inside the data frame 

> airquality$Celsius <- (airquality$Temp-32)/I.8 

> head(airquality$Temp) 

[1] 67 72 74 62 56 66 

> head(airquality$Celsius) 

[1] 19.44444 22.22222 23.33333 16.66667 13.33333 18.88889 

> newdata <- transform(airquality, logOzone = log(Ozone)) 

> head(newdata) 



Ozone 

Solar.R 

Wind 

Temp 

Month 

Day 

Celsius 

logOzone 

1 

41 

190 

7.4 

67 

5 

1 

19.44444 

3.713572 

2 

36 

118 

8.0 

72 

5 

2 

22.22222 

3.583519 

3 

12 

149 

12.6 

74 

5 

3 

23.33333 

2.484907 

4 

18 

313 

11.5 

62 

5 

4 

16.66667 

2.890372 

5 

NA 

NA 

14.3 

56 

5 

5 

13.33333 

NA 

6 

28 

NA 

14.9 

66 

5 

6 

18.88889 

3.332205 


2.27 Apply a function multiple times to parts of a data 
frame or array 

Problem: You want to apply the same function repetitively on every 
row or column of a matrix or data frame, or on every element in a list. 

Solution: R has a range of functions that can be used to execute a 
function repetitively by applying the function to each cell or margin of 
an array or element of a list. This prevents us from having to write 
loops to perform some operation repeatedly, although the same result 
could be obtained with a loop. 

The apply function is very versatile, and it requires three arguments: 
an array, a vector of the margin(s) over which the desired function 
should be applied over (the MARGIN argument), and the function that 
should be executed repetitively (the FUN argument). 

The MARGIN argument is used to specify which margin we want to ap¬ 
ply the function over and which margin we wish to keep. For example, 
if the array we are using is two-dimensional then we can specify the 
margin to be either 1 (apply the function to each of the rows) or 2 (ap¬ 
ply the function to the columns). 

The function defined by the option FUN can be a built-in or user-defined 
function. The vectors of values defined by the MARGIN option is passed 
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as the first argument to FUN and if additional parameters are included 
in the call to apply then they are passed on to the FUN function. As a 
consequence, the first argument of the function FUN should accommo¬ 
date the correct input for apply to work. 

> m <- matrix(c(NA, 2 , 3, 4, 5, 6), ncol=2) 

> m 

[, 1 ] [, 2 ] 

[1,] NA 4 

[2,] 2 5 

[3, ] 3 6 

> apply(m, 1, sum) # Compute the sum for each row 

[1] NA 7 9 

The 1 app 1 y function should always be used when working on the vari¬ 
ables in a data frame. 1 apply applies a function over a list so it will 
automatically work on every variable in the data frame and return the 
result as a list, sapply is a user-friendly version of 1 apply that returns 
its result as a vector or matrix if appropriate, lapply and sapply need 
just two arguments: the data frame or list and the function to apply (still 
given by the FUN argument). 

> data(airquality) 

> sapply(airquality, mean) # Compute the mean of each variable 

Ozone Solar.R Wind Temp Month Day 

NA NA 9.957516 77.882353 6.993464 15.803922 

Additional options to the applied function can be specified in the calls 
to apply, lapply, or sapply. For example, if we wish to disregard 
missing observations and trim away the lower and upper 10% of obser¬ 
vations for each variable, then we write the extra options to the mean 
function when calling sapply. 

> sapply(airquality, mean, na.rm=TRUE, trim 

Ozone Solar.R Wind Temp 

37.797872 190.338983 9.869919 78.284553 

Day 
15.804878 

The by function applies a function to subsets of a data frame, and it 
requires three arguments just like apply. The first argument should be 
a data frame, the second argument should be a factor (or list of factors) 
with the same length(s) as the number of rows in the data frame, and 
the last required argument is the function that should be applied to each 
subset of the data frame defined by the factor. 

For example, to calculate the mean of each variable for each month we 
write 

> by(airquality, airquality$Month, mean) 
airquality$Month: 5 


= .l) 

Month 

6.991870 
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Ozone Solar.R Wind Temp Month Day 

NA NA 11.62258 65.54839 5.00000 16.00000 


airquality$Month: 6 

Ozone Solar.R Wind Temp Month Day 

NA 190.16667 10.26667 79.10000 6.00000 15.50000 


airquality$Month: 7 

Ozone Solar.R Wind Temp Month 

NA 216.483871 8.941935 83.903226 7.000000 

Day 

16.000000 


airquality$Month: 8 

Ozone Solar.R Wind Temp Month Day 

NA NA 8.793548 83.967742 8.000000 16.000000 

airquality$Month: 9 

Ozone Solar.R Wind Temp Month Day 

NA 167.4333 10.1800 76.9000 9.0000 15.5000 

Here the mean of each variable is calculated for each month. Additional 
options to the applied function can be specified in the call to by just as 

for apply. 

See also: The tapply function applies a function over a "ragged ar¬ 
ray". 


2.28 Use a Box-Cox transformation to make non-normal- 
ly distributed data approximately normal 

Problem: You want to find the optimal power transformation of a 
vector of positive values to ensure that the values of the transformed 
vector are approximately normal. 

Solution: The Box-Cox transformation is a family of power transfor¬ 
mations that can be used to transform non-normally distributed quan¬ 
titative data to approximately normal distributed data. The Box-Cox 
transformation is defined as 



v x -i 

A 

log (y) 


if Ay^O 
if A = 0 


where A is the transformation parameter. A-values of 0,0.5 and 1 cor¬ 
respond to a logarithmic transformation, a square-root transformation 
and no transformation, respectively. 
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Figure 2.1: Profile log-likelihoods for the parameter of the Box-Cox power 
transformation for the trees data. 


Box-Cox transformations are often used in combination with linear mod¬ 
els when the residuals appear not to follow a normal distribution. In 
those situations, the aim of the Box-Cox transformation is to identify the 
optimal value of the power transformation parameter for the response 
variable such that the transformation "best" normalizes the data and 
ensures that the assumptions for linear model hold. 

The boxcox function from the MASS package can be used to apply the 
Box-Cox transformation to linear models, and the input can be either a 
formula or a previously fitted lm model object. The lambda option is 
a numerical vector that sets the values of the transformation parame¬ 
ter, A, to examine. It defaults to the range from -2 to 2 in steps of 0.1. 
plotitisa logical option that defaults to TRUE and it determines if the 
profile log-likelihood is plotted for the parameter of the Box-Cox trans¬ 
formation. boxcox returns a list of two vectors: x is a vector of A values 
and y is the corresponding vector of profile log-likelihood values. 

We use the cherry tree data, trees, to find a good transformation when 
we model the tree volume as a function of log height and log girth. 

> library(MASS) 

> data(trees) 

> result <- boxcox(Volume ~ log(Height) + log(Girth), data=trees, 

+ lambda=seq(-0.5, 0.6, length=13)) 

> # Find the lambda value with highest profile log-likelihood 

> result$x[which.max(result$y)] 

[1] -0.06666667 

The highest profile log-likelihood is attained for A = —0.06667 and we 
see from the profile log-likelihood plot in Figure 2.1 that the 95% con¬ 
fidence interval for the A parameter includes zero. Thus we conclude 
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that a log transform of the response variable would be reasonable for 
this model for the cherry tree data. 

The Box-Cox transformation is defined only for positive data values. 
This should not pose any problem because a constant can always be 
added if the set of observations contains one or more negative values. 
If you wish to find the Box-Cox transformation for a single vector of 
values then we use a linear model with no explanatory variables as 
shown below where we find that a square root transformation (A = 0.5) 
is not unreasonable. 

> y <- rnorm(lOO) +5 # Generate data 

> y2 <- y**2 # Square the results 

> result <- boxcox(y2 ~ 1, plotit=FALSE) 

> result$x[which.max(result$y)] 

[1] 0.4242424 

See also: See Rules 3.23 and 4.18 for model validation of linear models. 
Rule 3.7 gives an example of linear models. 


2.29 Calculate the area under a curve 

Problem: You have a set of repeated measurements for each individual 
and wish to calculate the area under the curve defined by the set of 
measurements over time. 

Solution: Repeated measurements occur for example when measure¬ 
ments are taken several times over time on a single individual and the 
resulting set of observations is essentially a response profile. In some 
situations it is desirable to summarize the set of repeated measurements 
by a single value like the mean, the maximum value, average increase 
or the area under the response curve. 

The area under the curve corresponds to the integral of the curve but 
since we only have information for the observed time points we need 
to approximate the underlying profile curve. If we assume that we can 
approximate the profile with straight lines between time points, then 
we can use the trapezoid rule to calculate the integral/area under the 
curve: 

f*b \ ^ 

/ f(x)dx « - - Xi-i)(yi + yi-i) 

** a Z i =2 

We can make our own function to calculate the area under the curve 
using the trapezoid rule as seen in the following code 
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> trapezoid <- function(x, y) { 

+ sum(diff(x)*(y[-1]+y[-length(y)]))/2 } 

> x <- 1:4 

> y <- c(0, 1, 1, 5) 

> trapezoid(x, y) 

[1] 4.5 

> x2 <- c(3, 4, 2, 1) # Reorder the same data 

> y2 <- c(l, 5, 1, 0) 

> trapezoid(x2, y2) # We get wrong result 

[1] -3.5 

While the trapezoid function works, it comes with several caveats: 
if any value is missing, it will return NA, it computes the wrong result 
when the a:-values are not increasing, etc. 

We can use the approxf un function together with integrate to make 
a much more versatile function to calculate the area under the curve, 
approx fun creates a function that performs the linear interpolation 
between points and integrate calculates the numerical integral, and 
combining these we can handle unsorted time values, missing observa¬ 
tions, ties for the time values, and integrating over part of the area or 
even outside the area. The only thing it is lacking is a bit of input error 
checking. 

The rule option to approxfun is an integer that determines how in¬ 
terpolation outside the observation range takes place. The default value 
of 1 which makes approxfun return NA outside the observational range. 
If rule=2 then a combined "last observation carried forward" and "first 
observation carried backward" approach is used. If different extrapo¬ 
lations are needed for the left and right side of the observational range, 
then we set rule to be an integer vector of length 2 with elements cor¬ 
responding to the extrapolation rule to the left and right, respectively. 
The yleft and yright options to approxfun sets the y-values that 
are returned outside the observation range when rule=2 instead of the 
"last observation carried forward" and "first observation carried back¬ 
ward" values. 

> auc <- function(x, y, from=min(x), to=max(x), ...) { 

+ integrate(approxfun(x,y, ...) , from, to)$value } 

> auc(x, y) 

[1] 4.5 

> auc(x2, y2) # Reordering is fine 

[1] 4.5 

> auc(x, y, from=0) # AUC from 0 to max(x) 

Error in integrate(approxfun(x, y, ...), from, to) : 

non-finite function value 

> auc(x, y, from=0, rule=2) # Allow extrapolation 

[1] 4.5 

> auc(x, y, from=0, rule=2, yleft=0) # Use value 0 to the left 
[1] 4.5 
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> auc(x, y, from=0, rule=2, yleft=.5) # Use 1/2 to the left 
[1] 5 

If we wish to compute the area under the curve for each individual in 
a data frame we can use the auc function together with the by func¬ 
tion. For example, to calculate the area under each growth curve for 
the ChickWeight data we can use the following code 

> data(ChickWeight) 

> result <- by(ChickWeight, ChickWeight$Chick, 

+ function(df) { auc(df$Time, df$weight) } ) 

> head(result) 

ChickWeight$Chick 

18 16 15 13 9 20 

74.0000 601.0000 852.9996 1397.4999 1708.9992 1607.9972 

> # Make sure that each area are based on the same interval 

> # from 0 to 21 days. Carry first and last observation 

> # backward and forward, respectively 

> result2 <- by(ChickWeight, ChickWeight$Chick, function(df) { 

+ auc(df$Time, df$weight, from=0, to=21, rule-2)}) 

> head(result2) 

ChickWeight$Chick 

18 16 15 13 9 20 

739.000 1087.000 1328.998 1397.500 1708.999 1607.997 

See also: Use Rule 2.16 to change the data to long format before using 
auc if they are originally in wide format. Alternatively, the appropriate 
x and y vectors may be created directly from the data for data frames 
in wide format. 
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Model formulas play a central role when specifying statistical models in 
many of the R functions. The basic format for a formula is of the form 
response ~ predictor (s). The ~ is read "is modeled by" so we 
interpret the expression as a specification that the response is modeled 
by a combination of the predictors. The predictor variable(s) can be 
either numeric variables, factors or a combination of both. 

Generally, terms are included in the model with the + operator which 
indicates that the explanatory variables have an additive effect on the 
response variable (possible through a transformation). An intercept 
term is implicitly included in R formulas and it can be dropped from 
the R code by including the term -1. Likewise, other terms can be re¬ 
moved by the - operator — not just the intercept. 

The code below uses the ChickWeight dataset to show examples of 
model formulas in combination with the linear model function, lm (the 
output from the calls below is not shown). The ChickWeight data 
frame contains information on the effect of diets on early growth of 
chickens. 

> data(ChickWeight) 

> attach(ChickWeight) 

> lm(weight ~ Time) 

> lm(weight ~ Time - 1) 

> lm(weight ~ Diet) 

> lm(weight ~ Diet - 1) 

> lm(weight ~ Diet + Chick) 

> # Analysis of covariance, 

> lm(weight ~ Time + Diet) 

Interactions can be specified in different ways in the model formula. 
The *, :, and A operators all introduce interaction terms in the model, 
but in different ways, var 1: var2 adds an interaction between the two 
terms, var 1 * var 2 includes not only the interaction but also the main 
effects, so varl*var2 is identical to varl + var2 + varl :var2. 
The formula (varl + var2 + var3) A k includes all the main effects 
and all possible interactions up to order k. For example, the formula 
(varl + var2 + var3) A 3 corresponds to varl*var2*var3 while 
the command (varl + var2 + var 3) A 2 corresponds to the formula 


# Linear regression 

# Linear regression through origin 

# One-way ANOVA 

# One-way ANOVA without ref. group 

# Two-way additive ANOVA 

common slope, one intercept per diet 
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Table 3.1: Operators for model formulas 


Operator Effect in model formula 
+ add the term to the model 

leave out the term from the model 
: introduce interaction between the terms 

* introduce interaction and lower-order effect of the terms 
A add all terms including interactions up to a certain order 
/ nest the terms on the left within those on the right 


varl*var2*var3 - varl: var2 : var3 which we also can write as 
varl*var2 + varl*var3 + var2*var3. The nested operator is 
written as varl/var2 which means that we fit a model with a main 
effect of varl plus var2 within varl. This is equivalent to the model 

varl + varl:var2. 

> lm(weight ~ Diet*Chick) # Two-way ANOVA with interaction 

> lm(weight ~ Diet + Chick + Diet:Chick) # Same as above 

> # Analysis of covariance, one slope and intercept for each diet 

> lm(weight ~ Time*Diet) 

> lm(weight ~ Diet/Chick) # Nested model 

> lm(weight ~ Diet + Diet:Chick) # Same as above 

> lm(weight ~ Diet*Chick - Chick) # Same as above 

A summary of operators for model formulas can be seen in Table 3.1. 

The inhibit function, I, inhibits the interpretation of formula operators, 
+, -, etc., so they can be used as arithmetical operators in the formulas. 
Normal arithmetic expressions, for example logarithms or square-roots 
can be included in the formulas for both the response variable and ex¬ 
planatory variables. 

The pressure dataset contains data on the relation between tempera¬ 
ture in degrees Celsius and vapor pressure of mercury (in millimeters 
of mercury). If we wish to model the logarithm of the mercury pressure 
as a quadratic function of temperature measured in Fahrenheit, then we 
write 

> data(pressure) 

> lm(log(pressure) ~ I(temperature*9/5+32) + 

+ I((temperature*9/5+32) A 2) , data=pressure) 

As shown in the code above, the response variable can have a function 
applied to it — in this case the natural logarithm. 

The interaction function computes the factor indentical to what is 
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obtained by specifying an interaction term in a model. We can use the 
interaction function outside model formulas to see how the result¬ 
ing interaction factor looks. In the following code, two factors are used 
to create a new factor that is the interaction between the two original 
factors. We can set the argument drop=TRUE to interaction to re¬ 
move any unused factor levels for the interaction. 

> fl <- factor(c("A", "B", "C", "B", "A", "C")) 

> f2 <- factor (c (1, 2, 1, 2, 1, 2)) 

> interaction(fl,f2) 

[1] A.1 B.2 C.l B.2 A.1 C.2 
Levels: A.l B.l C.l A.2 B.2 C.2 

> interaction(fl,f2, drop=TRUE) # Drop unused factor levels 
[1] A.l B.2 C.l B.2 A.l C.2 

Levels: A.l C.l B.2 C.2 

The offset function is used inside model formulas to fix the estimated 
coefficient of a numeric explanatory variable at the value 1. 

> lm(weight ~ offset(Time)) # Fix the slope of time at 1 

Call: 

lm(formula = weight ~ offset(Time)) 

Coefficients: 

(Intercept) 

111. 1 

> newy <- weight - Time # Subtract the effect of time 

> lm(newy ~ 1) # Same result as above 

Call: 

lm(formula = newy ~ 1) 

Coefficients: 

(Intercept) 

111. 1 

If we wish to fix the regression parameter at another value than 1, then 
we simply multiply by the desired constant inside offset. 

> lm(weight ~ offset(2.7*Time)) # Fix slope at 2.7 

Call: 

lm(formula = weight ~ offset(2.7*Time)) 

Coefficients: 

(Intercept) 

92.88 

The vertical bar, |, is used in some model functions (e.g., Ime and 
lmer) to indicate conditioning. See Rules 3.12 and 3.13 for examples. 
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Descriptive statistics 


3.1 Create descriptive tables 

Problem: You want to compute descriptive statistics for a single nu¬ 
meric vector or for all numeric variables in a data frame. 

Solution: The st at. desc function from the pas tecs package pro¬ 
vides information on the total number of observations, the number of 
observations that are NULL, the number of missing observations (NA), 
and calculates the minimum, maximum, range, sum, median, mean, 
standard error of the mean, 95% confidence interval of the mean, vari¬ 
ance, standard deviance and coefficient of variance. 

By default, stat. desc prints results in scientific notation but that can 
be circumvented by using the options function to set a penalty for 
printing numbers using scientific notation and to set the desired num¬ 
ber of digits. Non-numeric vectors are disregarded by the stat. desc 
function as seen for the variable supp in the following example. 


> library(pastecs) 

> options(scipen=100, digits=2) # Prevent scientific notation 

> data(ToothGrowth) 

> stat.desc(ToothGrowth) 



len 

supp 

dose 

nbr.val 

60.00 

NA 

60.000 

nbr.null 

0.00 

NA 

0.000 

nbr.na 

0.00 

NA 

0.000 

min 

4.20 

NA 

0.500 

max 

33.90 

NA 

2.000 

range 

29.70 

NA 

1.500 

sum 

1128.80 

NA 

70.000 

median 

19.25 

NA 

1.000 

mean 

18.81 

NA 

1.167 

SE.mean 

0.99 

NA 

0.081 

Cl.mean.0. 

,95 1.98 

NA 

0.162 

var 

58.51 

NA 

0.395 

std.dev 

7.65 

NA 

0.629 

coef.var 

0.41 

NA 

0.539 


If we are only interested in the descriptive statistics, such as the mini¬ 
mum, maximum and standard deviation, then we can add the option 
basic=FALSE in the call to stat. desc. Likewise, if we only want 
the basic statistics such as the number of observations and number of 
missing values, then we can use desc=FALSE. 
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Often, descriptive statistics are needed for different groups. We can cal¬ 
culate the descriptive statistics for each group with the by function. The 
following code computes the descriptive statistics for each supplement 
type. 


> by(ToothGrowth, ToothGrowth$supp, stat.desc, basic=FALSE) 
ToothGrowth$supp: OJ 



len 

supp 

dose 

median 

22.70 

NA 

1.00 

mean 

20.66 

NA 

1.17 

SE.mean 

1.21 

NA 

0.12 

Cl.mean.0. 

,95 2.47 

NA 

0.24 

var 

43.63 

NA 

0.40 

std.dev 

6.61 

NA 

0.63 

coef.var 

0.32 

NA 

0.54 


ToothGrowth$supp: VC 



len 

supp 

dose 

median 

16.50 

NA 

1.00 

mean 

16.96 

NA 

1.17 

SE.mean 

1.51 

NA 

0.12 

Cl.mean.0. 

,95 3.09 

NA 

0.24 

var 

68.33 

NA 

0.40 

std.dev 

8.27 

NA 

0.63 

coef.var 

0.49 

NA 

0.54 


stat. desc also has an option, norm=TRUE, to calculate normal distri¬ 
bution statistics (i.e., the skewness, the skewness significant criterium, 
the kurtosis, the kurtosis significant criterium, the Shapiro-Wilks statis¬ 
tic for test of normality and its associated p-value). 


> stat.desc(ToothGrowth, 



len 

supp 

median 

19.25 

NA 

mean 

18.81 

NA 

SE.mean 

0.99 

NA 

Cl.mean.0.95 

1. 98 

NA 

var 

58.51 

NA 

std.dev 

7.65 

NA 

coef.var 

0.41 

NA 

skewness 

-0.14 

NA 

skew.2SE 

-0.23 

NA 

kurtosis 

-1.04 

NA 

kurt.2SE 

-0.86 

NA 

normtest.W 

0.97 

NA 

normtest.p 

0.11 

NA 


basic=FALSE, norm=TRUE) 
dose 

1.00000000 
1.16666667 
0.08118705 
0.16245491 
0.39548023 
0.62887219 
0.53903330 
0.37229661 
0.60301903 
-1.54958333 
-1.27329801 
0.76490449 
0.00000002 
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Linear models 


3.2 Fit a linear regression model 

Problem: You want to fit a linear regression model to describe the 
linear relationship between two quantitative variables. 

Solution: Linear models (of which linear regression is a special case 
with just a single quantitative explanatory variable) are fitted in R us¬ 
ing the function lm. lm expects a model formula of the form y ~ x 
where y is the quantitative response variable and x is the quantitative 
explanatory variable. 

In the trees dataset, we seek to model the volume of the trees as a lin¬ 
ear function of the height. We can use the dat a option to specify a data 
frame that contains the variables or just make sure that the variables are 
accessible. 

> data(trees) 

> result <- lm(Volume ~ Height, data=trees) 

> result 

Call: 

lm(formula = Volume ~ Height, data = trees) 

Coefficients: 

(Intercept) Height 

-87.124 1.543 

The fitted linear regression has two parameters, the intercept and slope, 
and they are listed as (Intercept) and Height in the output, respec¬ 
tively. The summary function can be used to get additional information 
on the estimated parameters. 

> summary(result) 

Call: 

lm(formula = Volume ~ Height, data = trees) 

Residuals: 

Min IQ Median 3Q Max 

-21.274 -9.894 -2.894 12.067 29.852 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) -87.1236 29.2731 -2.976 0.005835 ** 

Height 1.5433 0.3839 4.021 0.000378 *** 
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Figure 3.1: Fit of a linear regression model to the cherry tree data. 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

Residual standard error: 13.4 on 29 degrees of freedom 
Multiple R-squared: 0.3579, Adjusted R-squared: 0.3358 

F-statistic: 16.16 on 1 and 29 DF, p-value: 0.0003784 


From the summary output we find the estimates, their corresponding 
standard errors and t -tests for the hypothesis that the parameter equals 
zero. The slope is significantly different from zero (p = 0.000378), and 
the volume increases on average by 1.5433 for each unit increase of 
height. The residual standard error of 13.4 is the estimated standard 
deviation of the residuals. 

We can use the abline function to add the fitted regression line to 
an existing plot by using the fitted model as argument. The output is 
shown in Figure 3.1. 


> plot(trees$Height, trees$Volume, 

+ xlab="Height", ylab="Volume") 

> abline(result) 


# Make scatter plot 

# Add fitted line 


See also: Rules 3.23 and 4.18 show how to make model validation for 
linear models. 
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3.3 Fit a multiple linear regression model 

Problem: You want to fit a multiple linear regression model to de¬ 
scribe the linear relationship between a quantitative response variable 
and several explanatory quantitative variables. 

Solution: Multiple linear regression is an extension of linear regression 
(see Rule 3.2) where we model a quantitative response as a function of 
two or more quantitative explanatory variables. 

The cherry tree dataset trees contains information not only on the tree 
volume and height, but also on the tree girth (diameter at breast height). 
We want to use both the height and girth, which are measures that are 
easily acquired to model the volume of the tree. 

> data(trees) 

> result <- lm(Volume ~ Height + Girth, data=trees) 

> result 

Call: 

lm(formula = Volume ~ Height + Girth, data = trees) 

Coefficients: 

(Intercept) Height Girth 

-57.9877 0.3393 4.7082 

The summary function is used to extract estimates and their correspond¬ 
ing standard errors. 

> summary(result) 

Call: 

lm(formula = Volume ~ Height + Girth, data = trees) 

Residuals: 

Min IQ Median 3Q Max 

-6.4065 -2.6493 -0.2876 2.2003 8.4847 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) -57.9877 8.6382 -6.713 2.75e-07 *** 

Height 0.3393 0.1302 2.607 0.0145 * 

Girth 4.7082 0.2643 17.816 < 2e-16 *** 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

Residual standard error: 3.882 on 28 degrees of freedom 
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442 

F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16 

The three parameters for the model are the intercept and two partial 
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regression slopes, and their estimates can all be found in the output 
from summary: 0.3393 for height and 4.7082 for girth. We see that both 
height and girth are statistically significant with p-values of 0.0145 and 
0, respectively. The residual standard error of 3.882 is the estimated 
standard deviation of the residuals. 

See also: Rules 3.23 and 4.18 show how to make model validation for 
linear models. 


3.4 Fit a polynomial regression model 

Problem: You wish to fit a polynomial regression model where the 
relationship between the response variable and an explanatory variable 
is a higher-order polynomial. 

Solution: In some situations the functional relationship between the 
response variable and a quantitative explanatory variable should be 
modeled as a fcth order polynomial. 

Polynomial regression is a special case of multiple linear regression. 
Thus we can use Rule 3.3 and the lm function to fit a polynomial regres¬ 
sion model by treating the different exponents of the explanatory vari¬ 
able as distinct independent variables in a multiple regression model. 
In the example we wish to model the number of deaths due to AIDS for 
a 19-year period. We use the inhibit function, I, to ensure that the A op¬ 
erator is interpreted as exponentiation and not as a formula interaction. 
Also, we subtract 1980 from the years to prevent numerical instability; 
1999 3 is a very large number and if we analyze the data on the original 
scale then R might run into numerical difficulties and will be unable to 
estimate all the model parameters. 

> year <- seq(1981, 1999) 

> deaths <- c(339, 1201, 3153, 6368, 12044, 19404, 29105, 

+ 36126, 43499, 49546, 60573, 79657, 79879, 

+ 73086, 69984, 61124, 49379, 43225, 41356) 

> 

> newyear <- year - 1980 

> model <- lm(deaths ~ newyear + I(newyear A 2) + I (newyear A 3)) 

> summary(model) 


Call: 

lm(formula = deaths ~ newyear + I(newyear A 2) + I(newyear A 3)) 
Residuals: 

Min IQ Median 3Q Max 
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-9983.7 -2102.0 -782.8 635.6 12592.4 


Coefficients: 


(Intercept) 
newyear 
I(newyear A 2) 
I(newyear A 3) 


Estimate Std. Error t value Pr(>|t|) 

3199.14 6783.11 0.472 0.643975 

-4754.21 2862.05 -1.661 0.117442 

1717.37 328.10 5.234 0.000101 *** 

-73.14 10.80 -6.771 6.3e-06 *** 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 


1 


Residual standard error: 5985 on 15 degrees of freedom 
Multiple R-squared: 0.9591, Adjusted R-squared: 0.9509 

F-statistic: 117.3 on 3 and 15 DF, p-value: 1.232e-10 

Thus we model the average number of AIDS-related deaths as the third 
degree polynomial 

y = -73.14a; 3 + 1717.37a; 2 - 4754.21a; + 3199.14, 


where y is the number of deaths and x is the number of years since 
1980. The summary output also shows that the third order coefficient 
is highly significant (p < 0.0001) so a second-order polynomial would 
provide a significantly worse fit. 

We can plot the data and the predicted relationship using the predict 
function as shown below. 

> # Plot the data and add the fitted cubic line 

> plot(year, deaths) 

> lines(seq(1981, 1999, .1), 

+ predict(model, data.frame(newyear=seq(1, 19, .1)))) 

The data and the fitted third-degree polynomial can be seen in Fig¬ 
ure 3.2. 


3.5 Fit a one-way analysis of variance 

Problem: You wish to perform a one-way analysis of variance to test if 
the means from several groups are equal. 

Solution: One-way analysis of variance is used to test if the means 
of two or more groups are equal. Since one-way analysis of variance 
is a special case of a linear model where the explanatory variable is a 
categorical factor, we can use the lm function for the analysis — we just 
need to make sure that the explanatory variable is coded as a factor. 
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Figure 3.2: Fit of a third-degree polynomial to AIDS deaths data. 


The OrchardSprays dataset provides information on eight different 
lime sulphur concentrations and their potency in repelling honey bees. 
We wish to analyze if there is a difference in potency among the eight 
treatments. 


> 

> 

1 

2 

3 

4 

5 

6 


data(OrchardSprays) 
head(OrchardSprays) 
decrease rowpos colpos 
57 1 1 

95 2 1 

8 3 1 

69 4 1 

92 5 1 

90 6 1 


treatment 

D 

E 

B 

H 

G 

F 


> model <- lm(decrease ~ treatment, 

> model 


data=OrchardSprays) 


Call: 

lm(formula = decrease ~ treatment, data = OrchardSprays) 


Coefficients: 
(Intercept) 
4.625 
treatmentE 
58.500 


treatmentB 

3.000 

treatmentF 

64.375 


treatmentC 

20.625 

treatmentG 

63.875 


treatmentD 

30.375 

treatmentH 

85.625 


> summary(model) 

Call: 

lm(formula = decrease ~ treatment, data = OrchardSprays) 
Residuals: 

Min IQ Median 3Q Max 

-49.000 -9.500 -1.625 3.813 58.750 
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Coefficients: 



Estimate 

Std. Error t 

value 

Pr (> 11 | ) 


(Intercept) 

4.625 

7.253 

0.638 

0.52631 


treatmentB 

3.000 

10.258 

0.292 

0.77101 


treatmentC 

20.625 

10.258 

2 . Oil 

0.04918 

* 

treatmentD 

30.375 

10.258 

2.961 

0.00449 

** 

treatmentE 

58.500 

10.258 

5.703 

4.60e-07 

•k -k -k 

treatmentF 

64.375 

10.258 

6.276 

5.3 9e-0 8 

•k -k -k 

treatmentG 

63.875 

10.258 

6.227 

6.4 8e-0 8 

kkk 

treatmentH 

85.625 

10.258 

8.347 

2.08e-ll 

•k -k -k 

Signif. codes: 0 '***' 0.001 '**' 

i—1 

o 

o 

0.05 



Residual standard error: 20.52 on 56 degrees of freedom 
Multiple R-squared: 0.7044, Adjusted R-squared: 0.6674 

F-statistic: 19.06 on 7 and 56 DF, p-value: 9.499e-13 

Categorical explanatory variables are parameterized differently from 
quantitative explanatory variables in R output. For categorical vari¬ 
ables, one of the levels is chosen to be the reference level and the other 
categories/levels are parameterized as differences or contrasts relative 
to this reference level. 

The (Intercept) term is the reference level of the treatment factor 
and it corresponds to treatment A. The estimate listed for treatmentB 
is the contrast relative to the reference level. Thus the average decrease 
for treatment B is 4.625 + 3.000 = 7.625. Likewise the standard er¬ 
ror found on the treatmentB line corresponds to the standard error 
of the difference between treatment B and treatment A. The /-value and 
p-value printed on each line are the values found by testing the hy¬ 
pothesis that the parameter is equal to zero. Hence, for the contrast 
treatmentB we get a /-value of 0.292 and a corresponding p-value of 
0.04918. We (barely) reject the hypothesis that the difference between 
the reference level (treatment A) and treatment B is zero. The test for 
the (Intercept) parameter corresponds to the hypothesis that the 
overall level of the reference group (treatment A) equals zero, which 
we fail to reject with a p-value of 0.52631. 

The dropl function tests each explanatory variable by removing the 
terms from the model formula and comparing the fit to the original 
model, dropl can be used for model reduction and for one-way anal¬ 
ysis of variance we should set the option test = "F" to ensure that 
dropl computes test statistics (and corresponding p-values) based on 
the F distribution. 

> dropl(model, test="F") 

Single term deletions 


Model: 
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decrease ~ treatment 


Df Sum of Sq 

RSS 

AIC 

F value 

Pr (F) 


<none> 

23570 

394.17 




treatment 7 56160 

79730 

458.16 

19.062 

9.499e-13 *** 


Signif. codes: 0 '***' 

0.001 

' * *' 0 

.01 

0.05 0.1 ' 

' 1 


From the output it is clear that there is a highly significant effect of 
treatment (F statistic of 19.062 and a p-value that is zero) so we reject 
the hypothesis that the decrease is the same for all treatments. Since 
we only have one explanatory variable for one-way analysis of vari¬ 
ance, we get the exact same result as we got from the F statistic in the 
summary output above. 

See also: If you only wish to compare the means of two groups, then you 
should use the t. test function which does not assume that the vari¬ 
ance of each group is identical (Rule 3.18). For more than two groups 
the oneway. test function can test equality of means without assum¬ 
ing equal variances of each group. See Rule 3.41 for the non-parametric 
Kruskal-Wallis test. Rule 4.6 shows how to make boxplots to visually 
compare the groups. 


3.6 Fit a two-way analysis of variance 

Problem: You wish to perform a two-way analysis of variance to ex¬ 
amine how the means from several groups are influenced by two cate¬ 
gorical explanatory variables. 

Solution: Two-way analysis of variance is used to test how two cate¬ 
gorical explanatory variables influence the mean level of the response 
variable. Two-way analysis of variance is a special case of a linear 
model where there are two categorical explanatory variables so we can 
use the lm function for the analysis — we just need to make sure that 
the explanatory variables are coded in R as factors. If there are mul¬ 
tiple observations per combination of the two explanatory variables, 
then we can include (and test for) an interaction between the two ex¬ 
planatory variables in the model. Otherwise we can only consider the 
additive model, where the two explanatory variables influence the re¬ 
sponse independently of each other. 

The dropl function tests each explanatory variable by removing the 
term from the model formula while preserving the hierarchical princi¬ 
ple. dropl can be used for model reduction and the option test="F" 
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Figure 3.3: Interaction plot for the forced expiratory volume data. 


should be used to ensure that dropl computes test statistics (and cor¬ 
responding p-values) based on the F distribution. 

In the example below, we use the fev data from the isdals package to 
examine if and how gender (0=female, l=male) and exposure to smok¬ 
ing (0=no, l=yes) influence the forced expiratory volume in children. 
The interaction . plot function can plot the means for different com¬ 
binations of the two explanatory factors, which might illustrate possible 
interactions, interaction . plot takes three arguments: x. factor 
which is the explanatory factor plotted along the x axis, trace, factor 
is the other explanatory variable that determines the individual traces 
and response which is the response variable. Figure 3.3 shows the 
interaction plot for the fev dataset. If there is no interaction, then the 
traces should be roughly parallel. The figure does not suggest a large 
interaction between the two variables but we include an interaction be¬ 
tween gender and smoking status in our initial model in order to test if 
there is an interaction. 


> library(isdals) 

> data(fev) 

> # Convert variables to factors and get meaningful labels 

> fev$Gender <- factor(fev$Gender, labels=c("Female", "Male")) 

> fev$Smoke <- factor(fev$Smoke, labels=c("No", "Yes")) 

> attach(fev) 

> interaction.plot(Gender, Smoke, FEV) 

> model <- lm(FEV ~ Gender + Smoke + Gender*Smoke, data=fev) 

> model 

Call: 

lm(formula = FEV ~ Gender + Smoke + Gender * Smoke, data = fev) 
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Coefficients: 


(Intercept) 

2.3792 


GenderMale 

0.3552 


SmokeYes 

0.5867 


GenderMale:SmokeYes 
0.4221 

> dropl(model, test="F") 

Single term deletions 

Model: 

FEV ~ Gender + Smoke + Gender * Smoke 


Df Sum of Sq RSS AIC F value Pr(F) 

433.40 -261.08 


<none> 


Gender:Smoke 1 2.5127 435.91 -259.30 3.7684 0.05266 . 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

The results from dropl show that the interaction is almost significant 
(p = 0.05266) when testing at a significance level of 5%. Here we choose 
to remove the interaction and refit the model without the interaction 
term. 

> model2 <- lm(FEV ~ Gender + Smoke, data=fev) 

> dropl(model2, test="F") 

Single term deletions 

Model: 

FEV ~ Gender + Smoke 

Df Sum of Sq RSS AIC F value Pr(F) 

<none> 435.91 -259.30 

Gender 1 25.436 461.35 -224.21 37.986 1.249e-09 *** 

Smoke 1 33.681 469.60 -212.63 50.300 3.450e-12 *** 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

> summary(model2) 

Call: 

lm(formula = FEV ~ Gender + Smoke, data = fev) 

Residuals: 

Min IQ Median 3Q Max 

-1.95758 -0.63880 -0.03123 0.52159 3.03942 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 2.35788 0.04774 49.394 < 2e-16 *** 

GenderMale 0.39571 0.06420 6.163 1.25e-09 *** 

SmokeYes 0.76070 0.10726 7.092 3.45e-12 *** 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

Residual standard error: 0.8183 on 651 degrees of freedom 
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Multiple R-squared: 0.112, Adjusted R-squared: 0.1093 

F-statistic: 41.07 on 2 and 651 DF, p-value: < 2.2e-16 

Both of the main effects (gender and smoking exposure) are significant 
as shown by dropl, so we conclude that there is an additive effect of 
both gender and smoking status on forced expiratory volume. From the 
summary output we see that boys on average have a forced expiratory 
volume that is 0.3957 liters larger than girls and that persons exposed 
to smoking have a larger expiratory volume than non-smokers! (This 
somewhat surprising result is caused by the fact that smoking status 
and age are correlated so generally the smokers are older than the non- 
smokers.) 

See also: Rule 3.7 analyzes the full dataset with all explanatory vari¬ 
ables. 


3.7 Fit a linear normal model 

Problem: You wish to analyze a dataset using a linear normal model 
where multiple explanatory variables (both categorical and quantita¬ 
tive) and possibly their interactions may be present. 

Solution: The class of linear normal models is extremely versatile and 
can be used to analyze data in many situations. Linear normal models 
are fitted in R by the lm function, with the appropriate model formula 
which can include both quantitative and categorical explanatory vari¬ 
ables as well as interactions between them. 

The functions dropl (with option test="F" to compute the F statis¬ 
tic and get proper p-values based on the F distribution) and summary 
can be used for model reductions and extracting parameter estimates, 
respectively. 

We use the f ev data from the is da Is package to illustrate how forced 
expiratory volume — a surrogate for lung capacity — is influenced by 
gender (0=female, l=male), exposure to smoking (0=no, Wyes), age 
and height for 654 children aged 3-19 years. The effect of height on 
forced expiratory volume is modeled as a quadratic polynomial and 
we allow for interactions between both gender and smoking status as 
well as between age and smoking status. Thus, we allow the effect of 
smoking on lung capacity to be different for boys and girls, and we also 
allow for the effect of age to depend on smoking status. 

> library(isdals) 
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> data(fev) 

> # Convert variables to factors and get meaningful labels 

> fev$Gender <- factor(fev$Gender, labels=c("Female", "Male")) 


> fev$Smoke <- 

factor(fev$Smoke, 

labels=c("No", 

"Yes")) 

> summary(fev) 





Age 

FEV 

Ht 

Gender 

Min. : 3.000 

Min. : 0.791 

Min. 

46.00 

Female:318 

1st Qu.: 8.000 

1st Qu.:1.981 

1st Qu. 

57.00 

Male :336 

Median :10.000 

Median :2.547 

Median 

61.50 


Mean : 9.931 

Mean :2.637 

Mean 

61.14 


3rd Qu.:12.000 

3rd Qu.:3.119 

3rd Qu. 

65.50 


Max. :19.000 

Max. :5.793 

Max. 

74.00 



Smoke 
No :589 
Yes: 65 

> # Fit the initial model. Interactions like Smoke*Age 

> # inherently include the main effects of Smoke and Age 

> model <- lm(FEV ~ Ht + I(Ht A 2) + Smoke*Gender + Smoke*Age, 

+ data=fev) 

> dropl(model, test="F") 

Single term deletions 


Model: 


FEV ~ Age + Ht + I(Ht A 2) 
Df Sum of Sq 


<none> 

Ht 

I(Ht A 2) 

Smoke:Gender 
Age:Smoke 


Smoke 
RSS 
100.50 
4.6001 105.10 
8.7753 109.28 
0.3906 100.89 
0.3525 100.85 


* Gender + Smoke * Age 

AIC F value Pr(F) 

-1208.9 

-1181.6 29.5684 7.663e-08 
-1156.2 56.4056 1.968e-13 
-1208.4 2.5108 0.1136 

-1208.6 2.2656 0.1328 


* * * 
★ ★ ★ 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 


Both the linear and quadratic terms of height are highly significant, but 
none of the two interactions are significant in this initial model. We 
choose to remove the interaction between smoke and gender and refit 
the reduced model. 


> model2 <- lm(FEV ~ Ht + I(Ht A 2) + Gender + Smoke*Age, 

+ data=fev) 

> dropl(model2, test="F") 

Single term deletions 


Model: 

FEV ~ Age + Ht + I(Ht A 2) + Gender + Smoke * Age 


<none> 

Df 

Sum of Sq 

RSS 

100.89 

AIC 

-1208.4 

F value Pr(F) 

Ht 

1 

4.9069 

105.80 

-1179.3 

31.4668 3.009e-08 *** 

I(Ht A 2) 

1 

9.2725 

110.16 

-1152.9 

59.4627 4.720e-14 *** 

Gender 

1 

1.3763 

102.27 

-1201.5 

8.8258 0.00308 ** 

Age:Smoke 

1 

0.2557 

101.15 

-1208.7 

1.6399 0.20080 
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Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

> # Remove the insignificant interaction and refit 

> model3 <- lm(FEV ~ Ht + I(Ht A 2) + Gender + Smoke + Age, 

+ data=fev) 

> dropl(model3, test="F") 

Single term deletions 

Model: 


FEV ~ Ht 

+ 

I(Ht A 2) + 

Gender 

+ Smoke 

+ Age 




Df 

Sum of Sq 

RSS 

AIC 

F value 

Pr (F) 


<none> 



101.15 

-1208.7 




Ht 

1 

4.7552 

105.90 

-1180.7 

30.4644 

4.922e-08 

k k k 

I(Ht A 2) 

1 

9.1323 

110.28 

-1154.2 

58.5063 

7.354e-14 

k k k 

Gender 

1 

1.2918 

102.44 

-1202.4 

8.2759 

0.004150 

k k 

Smoke 

1 

0.8493 

102.00 

-1205.2 

5.4411 

0.019973 

k 

Age 

1 

9.0777 

110.22 

-1154.5 

58.1563 

8.658e-14 

k k k 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

Since none of the terms in this model are insignificant, mode 13 becomes 
our final model, and we conclude that there is a significant effect of 
height, gender, smoke and age on forced expiratory volume. These ef¬ 
fects work additively as there are no interactions in the final model. We 
can get the parameter estimates from the final model using summary 
on the final model object. 

> summary(mode!3) 


Call: 

lm(formula = FEV ~ Ht + I(Ht A 2) + Gender + Smoke + Age, data=fev) 
Residuals: 

Min IQ Median 3Q Max 

-1.611896 -0.227158 0.006186 0.224175 1.805654 


Coefficients: 


(Intercept) 
Ht 

I(Ht A 2) 
GenderMale 
SmokeYes 
Age 


Estimate 
6.8945787 
-0.2742341 
0.0031251 
0.0945352 
-0.1332112 
0.0694646 


Std. Error 
1.4993579 
0.0496850 
0.0004086 
0.0328613 
0.0571079 
0.0091089 


t value Pr (>111) 
4.598 5.12e-06 
-5.519 4.92e-08 
7.649 7.35e-14 
2.877 0.00415 

-2.333 0.01997 

7.626 8.66e-14 


k k k 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 


1 


Residual standard error: 0.3951 on 648 degrees of freedom 
Multiple R-squared: 0.794, Adjusted R-squared: 0.7924 

F-statistic: 499.4 on 5 and 648 DF, p-value: < 2.2e-16 

The parameter estimates show that, on average, the forced expiratory 
volume increases 0.0695 liters per year, boys have a larger lung capacity 
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than girls (0.0945 liters), and that smoking reduces the lung capacity by 
0.1332 liters. The parabola defined by the effect of height opens upward 
so not surprisingly the lung capacity increases with height (given the 
values of height available in the data frame). 

Note that while it is possible to define interactions between quantita¬ 
tive explanatory variables in the model formulas in R, it may be quite 
difficult to interpret their meaning. Basically, the values of the two 
vectors are just multiplied together element-wise so it is indeed pos¬ 
sible to combine "apples and oranges." In most situations, however, 
it makes more sense to model interactions between quantitative ex¬ 
planatory variables by categorizing the quantitative variables first and 
then to include the interaction between the categorized variables in the 
model (see Rule 2.24 for how to categorize a quantitative variable). 

See also: Rules 3.23 and 4.18 cover model validation for linear normal 
models. The biglm function in package biglm can be used to fit linear 
models to large datasets with many cases. 


Generalized linear models 


3.8 Fit a logistic regression model 

Problem: You want to fit a logistic regression model to describe the re¬ 
lationship between a binary response variable and a set of explanatory 
variables. 

Solution: Logistic regression models are appropriate for binary out¬ 
comes; i.e., when the response variable is categorical with exactly two 
categories (typically denoted "success" and "failure"). Logistic regres¬ 
sion models the probability associated with one of the response cate¬ 
gories as a function of one or more explanatory variables. 

Logistic regression models are part of the class of generalized linear 
models which can be fitted in R using the glm function, glm takes a 
model formula as first argument, and needs the family=binomial 
argument set to ensure that the logistic regression model is used. The 
model formula for logistic regression models can include the response 
variable in two different ways: either as a factor or numeric vector with 
exactly two categories, or as a matrix with two columns, where the first 


www.Ebook777.com 




86 


The R Primer 


column denotes the number of successes and the second column is the 
number of failures. We will focus on the first situation here. 

If we have a dataset with exactly one observation per row, then we can 
specify the logistic regression model formula in glm in the same man¬ 
ner as for lm. The response variable should then be a factor, and the 
first level of this factor will denote failures for the binomial distribution 
and all other levels will denote successes. If the response variable is 
numeric, it will be converted to a factor, so a numeric vector of zeros 
and ones will automatically work as a response variable where "1" is 
considered success. 

The functions summary and dropl (with argument test = "Chisq" 
to use the \ 2 -distribution for likelihood ratio model reductions) can be 
used to obtain the parameter estimates and model reduction tests. 
Below, we wish to use the birthwt data from the MASS package to 
examine if low birth weight is influenced by the mother's race and/or 
the mother's smoking status. 

> library(MASS) 

> data(birthwt) 

> model <- glm(low ~ factor(race) + smoke, family=binomial, 

+ data=birthwt) 

> dropl(model, test="Chisq") 

Single term deletions 


Model: 

low ~ factor(race) + smoke 

Df Deviance AIC LRT Pr(Chi) 

<none> 219.97 227.97 

factor(race) 2 229.81 233.81 9.8299 0.007336 ** 

smoke 1 229.66 235.66 9.6869 0.001856 ** 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

> summary(model) 

Call: 

glm(formula = low ~ factor(race) + smoke, family = binomial, 
data = birthwt) 

Deviance Residuals: 

Min IQ Median 3Q Max 

-1.3442 -0.8862 -0.5428 1.4964 1.9939 

Coefficients: 

(Intercept) 
factor(race)2 
factor(race)3 
smoke 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ’ 1 


Estimate Std. Error 
-1.8405 0.3529 

1.0841 0.4900 

1.1086 0.4003 

1.1160 0.3692 


z value Pr (>|z | ) 
-5.216 1.83e-07 *** 
2.212 0.02693 * 

2.769 0.00562 ** 

3.023 0.00251 ** 
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(Dispersion parameter for binomial family taken to be 1) 

Null deviance: 234.67 on 188 degrees of freedom 
Residual deviance: 219.97 on 185 degrees of freedom 
AIC: 227.97 

Number of Fisher Scoring iterations: 4 

The results from the dropl function show that both race and smok¬ 
ing status are significant (with p-value of 0.007336 and 0.001856, re¬ 
spectively). The parameter estimates listed with the summary function 
show that the effect of smoking status is 1.1160, which is — since glm 
uses the logit link function by default — the estimated log odds ratio 
of low birth weight for smokers (smoke=l) compared to non-smokers 
(smoke=0). Likewise, race 2 has an estimated log odds ratio of 1.0841 
for low birth weight compared to the reference race (which is race 1). 
The exponential function, exp, can be used to transform the log odds 
ratio estimates back to odds ratios. Thus the odds ratio of low birth 
weight for smokers compared to non-smokers is exp(1.1160) = 3.0526 
with a corresponding 95% confidence interval for the odds ratio given 

b y 

> exp(1.1160 + c(-l,l)*l.96*0.3692) # 95% Cl for smoking 

[1] 1.480482 6.294222 

The last two columns in the output table from summary show the Wald 
test statistic and corresponding p-value for the hypothesis of testing 
each parameter equal to zero. 

Note that the manner in which the response variable is coded is im¬ 
portant when interpreting the output, glm models the probability that 
successes occur so if we want to model the probability of the failures 
we should change the sign of all parameter estimates. 

By default, glm uses the logit link function for binomial data; i.e., it 
models the probability of success through the logarithm of the odds of 
success. Another possibility is to use the probit link function which is 
useful when it is reasonable to assume that each result of success or 
failure is actually the discretely observed outcome of a continuous un¬ 
derlying normal distribution. The probit link function can replace the 
default logit function by adding the link option to the family argument; 
i.e., f amily=binomial (link="probit" ) . 

There is only one parameter in the binomial distribution — the proba¬ 
bility of success — so once that parameter is estimated, then the vari¬ 
ance is fully given. Sometimes, however, the mean may be correctly 
modeled but the observed variance is larger than the expected vari¬ 
ance so there is overdispersion. Overdispersion can be caused by omis- 
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sion of important explanatory variables, correlation between binary re¬ 
sponses, or a misspecified link function. The quasi-binomial family can 
be used with glm to allow for overdispersion, simply by using specify¬ 
ing family=quasibinomial. 

In the example below we refit the birth weight data using the probit 
link function and allowing for overdispersion. 

> model2 <- model2 <- glm(low ~ factor(race) + smoke, 

+ family=quasibinomial(link="probit"), 

+ data=birthwt) 

> summary(model2) 

Call: 

glm(formula = low ~ factor(race) + smoke, 

family = quasibinomial(link = "probit"), 
data = birthwt) 

Deviance Residuals: 

Min IQ Median 3Q Max 

-1.3395 -0.8884 -0.5265 1.4928 2.0222 

Coefficients: 



Estimate 

Std. Error 

t value 

Pr (> 111 ) 

(Intercept) 

-1.1292 

0.2009 

-5.620 

6.94e-08 

factor(race)2 

0.6635 

0.2959 

2.243 

0.02611 

factor(race)3 

0.6784 

0.2359 

2.876 

0.00451 

smoke 

0.6842 

0.2187 

3.128 

0.00205 

Signif. codes: 

| 0 ' k k k r 

0.001 'k k 1 

0.01 '* 

0.05 ' 


(Dispersion parameter for quasibinomial family taken to 
be 1.002370) 

Null deviance: 234.67 on 188 degrees of freedom 
Residual deviance: 219.63 on 185 degrees of freedom 
AIC: NA 

Number of Fisher Scoring iterations: 4 

The overdispersion parameter of 1.002370 is very close to 1 which sug¬ 
gests no overdispersion in this case. While there is no difference in the 
structure of the final model (both race and smoke are still significant) 
there is still a discrepancy between the parameter estimates from the 
model with the logit link function (model above) and the model with 
the probit link function (model2). The discrepancy, however, is not 
very large because the estimates from the logit link model are inter¬ 
preted as changes in log odds ratio while the estimates from the probit 
model are changes in the z score for the cumulative standard normal 
distribution. When the estimates are transformed to probabilities us¬ 
ing the inverse logit or the cumulative standard normal distribution. 
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$( 0 ), for the logit and probit link, respectively, then the probabilities of 
success are generally very similar. 

The probability of low birth weight for an individual who smokes and 
has race 1 is calculated for the two link functions as: 

> exp(-1.8405 + 1.1160) / (1 + exp (-1.8405 + 1.1160)) 

[1] 0.3264028 

> pnorm(-1.1292 + 0.6842) 

[1] 0.3281599 

Both link functions provide virtually the same result. 

See also: The help page for help (family) lists other available link 
functions for logistic regression, including the complementary log-log. 


3.9 Fit a multinomial logistic regression model 

Problem: You want to fit a multinomial logit regression model to de¬ 
scribe the relationship between a polytomous categorical response vari¬ 
able and a set of explanatory variables. 

Solution: Logistic regression can be extended to handle response vari¬ 
ables that have more than two categories. Here we will assume that the 
response is nominal (consists of unordered categories) — see Rule 3.11 
for ordered categories (i.e., when the response categories are ordinal) 
— and we want to model how a set of explanatory variables influence 
the probability of observing the different polytomous outcomes. 
Multinomial logistic regression models can be fitted in R using neural 
networks which are implemented by the nnet package. The mult inom 
function handles multinomial logistic regression models and the func¬ 
tion takes a model formula as first argument. The model formula for 
multinomial logistic regression can include the response variable in two 
different ways: either as a factor with k categories (where the first cat¬ 
egory will be the reference level) or as a matrix with k columns where 
the elements in the matrix are interpreted as counts for the different 
categories. We will focus on the first situation here. 

The anova function can be used for model reduction for multinomial 
logistic regression models by comparing the fit of two nested models, 
and summary prints the parameter estimates. 

The alligator data frame from the isdals package contains infor¬ 
mation on the length and primary food choice ("fish", "invertebrates", 
or "other") of 59 alligators. It is of interest to examine how the primary 
food choice depends on the size of the alligators. 
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> library(nnet) 

> library(isdals) 

> data(alligator) 

> head(alligator) 

length food 

1 1.24 Invertebrates 

2 1.32 Fish 

3 1.32 Fish 

4 1.40 Fish 

5 1.42 Invertebrates 

6 1.42 Fish 

> model <- multinom(food ~ length, data=alligator) 

# weights: 9 (4 variable) 
initial value 64.818125 
iter 10 value 49.170710 
final value 49.170622 
converged 

> null <- multinom(food ~ 1, data=alligator) 

# weights: 6 (2 variable) 
initial value 64.818125 
iter 10 value 57.570928 
iter 10 value 57.570928 
iter 10 value 57.570928 
final value 57.570928 
converged 

> anova(null, model) # Compare the two models 

Likelihood ratio tests of Multinomial Models 

Response: food 

Model Resid. df 
1 1 116 
2 length 114 

Pr(Chi) 

1 

2 0.0002247985 

> summary(model) 

Call: 

multinom(formula = food ~ length, data = alligator) 

Coefficients: 

Invertebrates 
Other 

Std. Errors: 

(Intercept) length 
Invertebrates 1.468640 0.8032870 
Other 1.307274 0.5170823 

Residual Deviance: 98.34124 
AIC: 106.3412 


(Intercept) length 

4.079701 -2.3553303 
-1.617713 0.1101012 


Resid. Dev Test 
115.14186 
98.34124 1 vs 2 


Df LR stat. 


2 16.80061 
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Figure 3.4: Probabilities of primary food preference for alligators of different 
lengths. Solid, dashed and dotted lines correspond to "fish", "invertebrates", 
and "other", respectively. 


The likelihood ratio test statistic for comparing the model where alliga¬ 
tor length is included to the model where alligator length is excluded 
yields a p-value of 0.000224 so there is a clear effect of alligator size on 
primary food choice. 

Presenting the results from a multinomial logistic regression model can 
be somewhat tricky since there are multiple equations and multiple 
comparisons to present. The estimates printed by the summary func¬ 
tion are log-odds values relative to the reference outcome ("fish" in this 
situation), so the estimates refer to the change in the log odds of the 
outcome relative to the reference outcome associated with a change in 
the explanatory variable. 

For example, the estimate —2.355 for length of invertebrates says that 
the change in log odds of eating invertebrates relative to eating fish 
changes by —2.355 for each increase in alligator length of 1 meter. The 
estimate is not the overall change in odds of eating invertebrates for 
two alligators who differ by a length of 1 meter. Likewise, the odds 
ratio that an alligator of, say, length 2 meters prefers "invertebrates" to 
"other" is exp((4.080 - 2 • 2.355) - (-1.618 + 2 • 0.110)) = 2.152. 

Since we only have a single quantitative explanatory variable in this 
model, we can illustrate how the probabilities of the three response 
groups change by alligator length. This is done below where predict 
(with argument type="probs ") is used to estimate the probabilities 
of the different responses under the model. 

> len <- seq(0.3, 4, .1) 

> matplot(len, predict(model, newdata=data.frame(length=len), 
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+ type="probs"), type="l", 

+ xlab="Alligator length", ylab="Probability") 

The result is shown in Figure 3.4 where we see that the probabilities of 
the three responses sum to one for all lengths. 

See also: The vglm function from the VGAM package can fit multino¬ 
mial logistic models (with option f amily=multinomial) using vec¬ 
tor generalized linear models. 


3.10 Fit a Poisson regression model 

Problem: You want to fit a Poisson regression model to describe the 
relationship between a count response variable and a set of explanatory 
variables. 

Solution: Poisson regression models can be used to deal with situa¬ 
tions where we wish to model count data; e.g., when the response vari¬ 
able is obtained by counting the number of occurrences of an event. 
Poisson regression models are part of the class of generalized linear 
models which can be fitted using the glm function in R if the argument 
family=poisson is specified. The first argument to glm is the model 
formula, where the variable corresponding to the response should be a 
vector of non-negative integers. 

The functions summary and dropl (with argument test = "Chisq" 
to use the y 2 -distribution for likelihood ratio model reductions) can be 
used to obtain the parameter estimates and model reduction tests. 

The log link is the default link function for Poisson data in glm so it 
models the natural logarithm of the expected counts. Two alternate 
link functions are available for Poisson models, namely identity and 
sqrt, and they are selected by adding the link option to the family 
argument, for example family=poisson (link="identity" ). 
Poisson regression models can also be used with rate data, where the 
rate is a count of events divided by the corresponding exposure. Differ¬ 
ent observational units may have different exposures (e.g., some counts 
are registered over time intervals of different lengths, or the counts 
could be based on groups of different sizes) so for rate data, an offset 
variable is included in the model to represent some measure of expo¬ 
sure. An offset is variable that is forced to have a regression coefficient 
of 1, and for the log link function the offset contains the (natural) loga¬ 
rithm of the exposure. The offset is included in the model by adding a 
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offset (log (exposure) ) term to the explanatory variable(s) in the 
model formula (where exposure is the name of the exposure variable). 
The bees dataset from the kulife package contain information on 
the number of bees caught in different-colored cages at four different 
locations for two types of bees. It is of interest to see if either of the 
two types of bees has a preference for the color of the cages. For brevity 
of output, we model the number of caught bees with a main effect of 
location, and an interaction between color and bee type although the 
data allow for a full three-way interaction. 

> library(kulife) 

> data(bees) 

> beedata <- subset(bees, Time=="julyl") # Look at one day only 

> head(beedata) 



Locality 

Replicate 

Color 

Time 

Type 

Number 

id 

1 

Havreholm 

A 

White 

julyl 

Bumblebees 

1 

1 

4 

Havreholm 

A 

Yellow 

julyl 

Bumblebees 

2 

1 

7 

Havreholm 

A 

Blue 

julyl 

Bumblebees 

0 

1 

10 

Havreholm 

A 

White 

julyl 

Solitary 

1 

1 

13 

Havreholm 

A 

Yellow 

julyl 

Solitary 

4 

1 

16 

Havreholm 

A 

Blue 

julyl 

Solitary 

3 

1 


> model <- glm(Number ~ Locality + Type*Color, 

+ family=poisson, data=beedata) 

> dropl(model, test="Chisq") 

Single term deletions 

Model: 

Number ~ Locality + Type * Color 

Df Deviance AIC LRT Pr(Chi) 

<none> 84.215 195.43 

Locality 3 132.051 237.26 47.836 2.308e-10 *** 

Type:Color 2 93.956 201.16 9.740 0.007672 ** 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

> summary(model) 

Call: 

glm(formula = Number ~ Locality + Type * Color, 
family = poisson, data = beedata) 


Deviance Residuals: 


Min IQ 

Median 


3Q 


Max 






-2.1173 -0.9826 - 

-0.6551 

0. 

6573 


2.3147 






Coefficients: 












Estimate 

Std. 


Error z 

value 

Pr (> | z 

1 ) 

(Intercept) 

-1 . 

,0647 


0 

.7147 - 

-1, 

.490 

0. 

,1362 

68 

LocalityKragevig 

-0. 

.4745 


0 

.2407 - 

-1, 

.971 

0. 

, 0487 

06 

LocalitySaltrup 

-1. 

,8608 


0 

.4063 - 

-4 . 

.580 

4 . 

, 66e- 

06 

LocalitySvaerdborg 

-1. 

,8608 


0 

.4063 - 

-4 . 

.580 

4 . 

, 66e- 

06 

TypeSolit 

2. 

,5649 


0 

.7338 

3. 

.495 

0. 

,0004 

73 
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ColorWhite 

1.3863 

0.7906 

1.754 

0.079509 . 

ColorYellow 

1.8718 

0.7596 

2.464 

0.013726 * 

TypeSolit:ColorWhite 

-1.7540 

0.8479 

-2.069 

0.038589 * 

TypeSolit:ColorYellow 

-2.1342 

0.8157 

-2.616 

0.008888 * * 

Signif. codes: 0 '***' 

0.001 ' * * ' 

0.01 

0.05 ' 

0.1 ' ' 


(Dispersion parameter for poisson family taken to be 1) 

Null deviance: 162.785 on 71 degrees of freedom 
Residual deviance: 84.215 on 63 degrees of freedom 
AIC: 195.42 

Number of Fisher Scoring iterations: 6 

The results from the dropl function shows that there is an interaction 
between bee type and cage color (the likelihood ratio test yields a p- 
value of 0.007672) and that location is extremely significant. 

The parameter estimates for a Poisson regression model with log link 
have relative risk interpretations: the estimated effect of an explanatory 
variable is multiplicative on the rate, and thus leads to a risk ratio or 
relative risk. The parameter estimates listed with the summary function 
are relative to the reference group which consists of bumblebees and 
blue cages and the location "Havreholm". For example, the relative 
risk of observing a solitary bee in a yellow cage relative to a bumblebee 
in a blue cage at the same location is 

> exp(2.5649 + 1.8718 - 2.1342) 

[1] 9.99915 

The estimates also show that, apparently, bumblebees prefer yellow 
and white cages to blue, while solitary bees appear to be slightly more 
indifferent to cage color. 

The last two columns in the summary output show the Wald test statis¬ 
tic and corresponding p-value for the hypothesis of testing each param¬ 
eter equal to zero. 

For the Poisson distribution, the variance is fully determined by the 
mean so once that parameter is estimated then the variance is given. 
Sometimes, however, the data are more variable than the Poisson dis¬ 
tribution predicts and we say that there is overdispersion. Overdisper¬ 
sion can be caused by omission of important explanatory variables, cor¬ 
relation between responses, or a misspecified link function. The quasi- 
Poisson family can be used with the glm function to allow for overdis¬ 
persion, simply by using specifying family=quasipoisson. 

Below we refit the Poisson model for the bees data allowing for overdis¬ 
persion. 

> model2 <- glm(Number ~ Locality + Type*Color, 
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+ family=quasipoisson, data=beedata) 

> dropl(model2, test="Chisq") 

Single term deletions 


Model: 

Number ~ Locality + Type * Color 

Df Deviance scaled dev. Pr(Chi) 

<none> 84.215 

Locality 3 132.051 33.388 2.668e-07 *** 

Type:Color 2 93.956 6.798 0.0334 * 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

> summary(model2) 

Call: 

glm(formula = Number ~ Locality + Type * Color, 
family = quasipoisson, data = beedata) 

Deviance Residuals: 

Min IQ Median 3Q Max 


-2.1173 -0.9826 -0. 

6551 

0. 

6573 


2.3147 






Coefficients: 












Estimate 

Std 

. ] 

Irror t 

value 

Pr (> |t | ) 


(Intercept) 

-1. 

.0647 


0 

. 8554 

-i. 

.245 

0, 

.217863 


LocalityKragevig 

-0. 

.4745 


0 

.2881 

-i. 

, 647 

0, 

.104580 


LocalitySaltrup 

-1. 

.8608 


0 

.4863 

-3. 

.826 

0. 

.000302 

★ ★ ★ 

LocalitySvaerdborg 

-1. 

.8608 


0 

.4863 

-3. 

.826 

0, 

.000302 

* * * 

TypeSolit 

2. 

.5649 


0 

.8783 

2 . 

.920 

0, 

. 004848 

•k k 

ColorWhite 

1. 

.3863 


0 

. 9463 

1. 

,465 

0. 

.147898 


ColorYellow 

1. 

. 8718 


0 

. 9092 

2 . 

.059 

0. 

. 043651 

k 

TypeSolit:ColorWhite 

-1. 

. 7540 


1 

. 0150 

-1. 

.728 

0, 

.088859 


TypeSolit:ColorYellow 

-2. 

.1342 


0 

. 9764 

-2 . 

.186 

0, 

. 032555 

k 

Signif. codes: 0 '** 

*' 0. 

.001 

' * * * 

0 

.01 

0. 

.05 


’ 0.1 ' 

' 1 


(Dispersion parameter for quasipoisson family taken to 
be 1.432730) 

Null deviance: 162.785 on 71 degrees of freedom 
Residual deviance: 84.215 on 63 degrees of freedom 
AIC: NA 


Number of Fisher Scoring iterations: 6 

The overdispersion parameter is 1.433 which suggests some overdis¬ 
persion in this situation. While there is no difference in the fixed effect 
parameters, we see from the output that the standard errors are larger 
now. This is also seen in the test for the hypothesis of no interaction 
between cage color and bee type where we now get a p-value of 0.0334 
so the interaction is only borderline significant. 

We can create confidence intervals for differences between categories 
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as usual. For example, the 95% confidence interval for the relative risk 
between the "Kragevig" location and "Havreholm" is 

> exp(-0.4745 + c (-1,1)*1.96*0.2881) # 95% Cl Kragevig/Havreholm 
[1] 0.3537460 1.0943669 

Thus, we are 95% confident that the interval from 0.35 to 1.09 contains 
the true relative risk between the Kragevig and Havreholm locations. 


3.11 Fit an ordinal logistic regression model 

Problem: You want to fit an ordinal logistic regression model to de¬ 
scribe the relationship between an ordinal response variable and a set 
of explanatory variables. 

Solution: Logistic regression models can be extended to handle re¬ 
sponse variables with more than two categories as shown in Rule 3.9. 
Here we assume that the response categories are ordered (i.e., when the 
response categories are ordinal) and we want to model how different 
explanatory variables influence the probability of observing the differ¬ 
ent polytomous outcomes. 

Ordinal data occur when it is possible to order and rank the observa¬ 
tions but when the real distance between the categories is unknown. 
Examples of ordered variables include disease severity (least severe 
to most severe or disease stages), pain scales, survey responses (from 
strongly disagree to strongly agree) and classifications of continuous 
variables. 

The proportional odds model is an example of a model for ordinal re¬ 
sponse data where the explanatory variables have a unified effect on all 
response categories. The ordinal nature of the response are preserved 
by considering the odds of a response category or a lesser category so 
we are essentially looking at cumulative probabilities for the ordered 
response categories. The major advantage of the proportional odds 
model lies in the interpretation of the effect of explanatory variables: 
The change in the cumulative odds for any response category is the 
same regardless of which response category we consider. 

Several implementations of ordinal regression models exist in R. Here 
we will focus on the polr function (found in the MASS package) which 
can be used for proportional odds logistic (and probit) regression mod¬ 
eling. polr expects a model formula with an ordered factor as response 
variable. A non-ordered response variable is converted to an ordered 
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factor where the ordering is given by the ordering of the response factor 
levels. 

The functions summary and dropl (with argument test = "Chisq" 
to use the y 2 -distribution for likelihood ratio model reductions) can be 
used to obtain the parameter estimates and model reduction tests. 

In the example below we use the survey dataset from the MASS pack¬ 
age to model the exercise frequency of statistics students as a function 
of their gender and age. We first use the ordered function on the re¬ 
sponse variable to ensure that R uses the correct ordering. 


> library(MASS) 

> data(survey) 

> resp <- ordered(survey$Exer, levels=c("None", "Some", "Freq")) 

> head(resp) 

[1] Some None None None Some Some 
Levels: None < Some < Freq 

> model <- polr(resp ~ Sex + Age + Smoke, data=survey) 

> dropl(model, test="Chisq") 

Single term deletions 

Model: 

resp ~ Sex + Age + Smoke 

Df AIC LRT Pr(Chi) 

<none> 451.82 

Sex 1 454.37 4.5441 0.03303 * 

Age 1 449.97 0.1524 0.69626 

Smoke 3 452.98 7.1573 0.06705 . 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

> model2 <- polr(resp ~ Sex + Smoke, data=survey) 

> dropl(model2, test="Chisq") 

Single term deletions 

Model: 

resp ~ Sex + Smoke 

Df AIC LRT Pr(Chi) 

<none> 449.97 

Sex 1 452.52 4.5406 0.03310 * 

Smoke 3 451.09 7.1122 0.06841 . 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

> model3 <- polr(resp ~ Sex , data=survey) 

> dropl(model3, test="Chisq") 

Single term deletions 

Model: 
resp ~ Sex 

Df AIC LRT Pr(Chi) 

<none> 451.09 

Sex 1 453.33 4.2392 0.0395 * 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 
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Age and smoke are both insignificant (p-values of 0.6963 and 0.06841, 
respectively). They are both removed from the model after successive 
tests, and only gender remains significant, summary yields the param¬ 
eter estimates. 

> summary(model3) 

Re-fitting to get Hessian 
Call: 

polr(formula = resp ~ Sex, data = survey) 

Coefficients: 

Value Std. Error t value 
SexMale 0.4176 0.2517 1.659 

Intercepts: 

Value Std. Error t value 
None|Some -1.9935 0.2405 -8.2875 

Some|Freq 0.2708 0.1788 1.5148 

Residual Deviance: 445.0867 
AIC: 451.0867 

(1 observation deleted due to missingness) 

The estimates in the summary output are split into two sections. The 
effect of the explanatory variables are listed in the "Coefficients" sec¬ 
tion. For any given exercise group, we have that men are more likely to 
have higher categories (i.e., more frequent exercise) than women since 
the estimate 0.4176 is positive. Thus, for a given category the odds of 
being placed in a higher (more frequent) exercise category increases by 
a factor 1.518 for men relative to women (since exp(0.4176) = 1.518313). 
The section of results titled "Intercepts" gives the intercepts for all but 
the last response category. Thus, the estimated probability that, say, a 
female student never exercises is 

> exp(-1.9935) / (1 + exp(-1.9935)) 

[1] 0.1198871 

If we wish to calculate the probability that, say, a male student exer¬ 
cises sometimes then we should keep in mind that we are modeling 
cumulative probabilities so we have to subtract the probability of the 
first category (no exercise) from the probability of "exercise sometimes 
or less" to obtain the probability of "exercise sometimes". 

> exp(0.2708 - 0.4176) / (1 + exp(0.2708 - 0.4176)) - 
+ exp(-1.9935 - 0.4176) / (1 + exp(-1.9935 - 0.4176)) 

[1] 0.3810356 

Note here, that it is not a typo that there is a negative sign before the 
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coefficient for the explanatory variable sex. Recall that we are model¬ 
ing cumulative probabilities, so an increase in cumulative probability 
means that it is more likely to observe a lower response category/score. 
The change of sign (which normally is positive for the other models we 
consider) for the explanatory variable(s) is due to the parameterization 
and because we prefer to have positive coefficients indicate an associ¬ 
ation with higher categories /scores. Thus, a positive estimate means 
that higher response categories are more likely and a negative estimate 
means that lower response categories are more likely. In this case the 
difference between males and females is positive, so males are more 
likely to have more frequent exercise status. 

The probabilities of the individual categories can also be obtained by 
the predict function with option type="prob". We find the same 
probabilities as above in the output below. 

> head(predict(model3, type="prob")) 

None Some Freq 

1 0.11988391 0.4474123 0.4327038 

2 0.08232958 0.3810487 0.5366217 

3 0.08232958 0.3810487 0.5366217 

4 0.08232958 0.3810487 0.5366217 

5 0.08232958 0.3810487 0.5366217 

6 0.11988391 0.4474123 0.4327038 

A 95% confidence interval for, say, the odds ratio between the cumula¬ 
tive probabilities for males and females is 

> exp(0.4176 + qnorm(.975)*c(-1,1)*0.2517) 

[1] 0.927073 2.486616 

Hence, the odds that a person will score higher than a given category 
will with a probability of 95% be 0.93 to 2.49 times larger if it is a male 
rather than a female. 

By default, glm uses the logit link function for binomial data; i.e., it 
models the probability of success through the logarithm of the odds of 
success. Another possibility is to use the probit link function which is 
useful when it is reasonable to assume that each result of success or 
failure is actually the discretely observed outcome of a continuous un¬ 
derlying normal distribution of probabilities. The probit link function 
can be used instead of the default logit function by setting the argument 
method= "probit". Alternatively, polr allows for a complementary 
log-log ora Cauchy latent variable links with the method="cloglog" 
or method="cauchit arguments, respectively. 

One of the assumptions underlying ordinal logistic (and ordinal pro¬ 
bit) regression is that the relationship between each pair of outcome 
groups is the same. Thus, the proportional odds assumption should be 
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checked before the model is used. We check this assumption by first 
fitting a multinomial regression model (see Rule 3.9) and then we com¬ 
pare the deviance differences between the two models. Under the null 
hypothesis that the ordinal regression model is correct, the deviance 
difference approximately follows a ^-distribution with a number of 
degrees of freedom corresponding to the difference in number of pa¬ 
rameters between the two models. Deviances are extracted from model 
fits using the deviance function, and the effective number of degrees 
of freedom are found for both models as the edf element. 

> library(nnet) 

> multi <- multinom(Exer ~ Sex + Smoke + Age, data=survey) 

# weights: 21 (12 variable) 

initial value 258.173888 
iter 10 value 215.797234 
final value 215.611365 
converged 

> 1-pchisq(deviance(model) - deviance(multi), 

+ df=multi$edf - model$edf) 

[1] 0.744158 

Hence we fail to reject the proportional odds regression model (p = 
0.744158) which means we can use the proportional odds model to an¬ 
alyze the ordinal response variable. 

See also: The vglm function from the VGAM package can fit ordinal logis¬ 
tic models (with option f amily=propodds) using vector generalized 
linear models. 


Methods for analysis of repeated measurements 


3.12 Fit a linear mixed-effects model 

Problem: You want to fit a linear mixed-effects model where some of 
the model terms are considered random effect. 

Solution: Mixed models are statistical models that include not only 
traditional "fixed effects" terms as used in linear and generalized linear 
models but also "random effect" terms. Random effects are essentially 
random variables and they enter the model differently from the fixed 
effects. Random effects are appropriate for representing clustering — 
and hence to model dependent observations — but they can also be 
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used to model some effects as random variations around a population 
mean. 

Interpretation of the random effects is different from the fixed effects. In 
particular, the random effects are assumed to be a random sample from 
a population so any results can be generalized to the full population 
from which the sample was drawn. For a categorical explanatory vari¬ 
able, the number of parameters in the model increases with the number 
of categories when the variable is included as a fixed effect. Random 
effect variables do not have this drawback as the number of model pa¬ 
rameters introduced by a random effect is constant. 

The lme4 package provides the lmer function which extends the lm 
function to fit linear mixed-effect models. The input syntax for lmer is 
almost identical to lm except that lmer uses a special notation for the 
random effect terms in the model formula, lmer has a REML argument 
which is TRUE by default and ensures that lmer computes restricted 
maximum-likelihood estimates. If REML=FALSE then maximum-likeli¬ 
hood estimates are produced. 

Random effects are specified in the model formula by the vertical bar 
(" | ") character. The grouping factor for the random effect term (typi¬ 
cally just the name of a variable) is specified at the right of the vertical 
bar. The expression to the left of the vertical bar defines the model for 
the random effect that is generated for each level of the grouping fac¬ 
tor. A simple random-effects model where there is one random effect 
for each grouping level is specified by setting the random effect term to 
(1 | variable) ; i.e., the expression to the left of the vertical bar corre¬ 
sponds to the normal formula for an intercept, 1. 

Several random effect terms can be included simply by adding more 
terms to the model. Nested effects are handled by including a random 
effects term for all levels of nesting. For example, if you have two vari¬ 
ables, city and country, where city is nested within country then 
(1 | city) + (1 | country) should be added to the model. Corre¬ 
lated random effects can be specified by changing the expression to the 
left of the vertical bar. For example, if x is a categorical variable then 
the random effects term (1+x | country) allows for a random effect 
for each country and allows for a correlation between the random ef¬ 
fects for the levels of x. 

In the following we use the ChickWeight data frame to model the log¬ 
arithm of chicken weight over time for different diets. There are up to 
12 measurements on each chicken from time of birth to 21 days old, and 
each chicken has received one of four experimental feeds. We wish to 
account for a biological variation between chickens, but we are not in¬ 
terested in the particular chickens from the sample and only consider 
them a random sample from the population of chickens. By including 
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the term (1 | Chick) we allow for a positive correlation among mea¬ 
surements taken on the same chicken after the fixed effects have been 
taken into account. An interaction between time and diet is included to 
allow for different diets to influence the rate of the weight gain. Note 
that even though the diets are numbered as 1-4, it is still coded as a 
factor in the data frame. 


> data(ChickWeight) 

> library(lme4) 

Loading required package: Matrix 
Loading required package: lattice 

> head(ChickWeight) 

Grouped Data: weight ~ Time | Chick 
weight Time Chick Diet 

1 42 0 11 

2 51 2 11 

3 59 4 11 

4 64 6 11 

5 76 8 11 

6 93 10 11 

> model <- lmer(log(weight) ~ Time + Diet + Time*Diet + (1|Chick), 

+ data=ChickWeight) 

> summary(model) 

Linear mixed model fit by REML 

Formula: log(weight) ~ Time + Diet + Time * Diet + (1 | Chick) 
Data: ChickWeight 

AIC BIC logLik deviance REMLdev 

-276.0 -232.4 148.0 -355.1 -296.0 

Random effects: 

Groups Name Variance Std.Dev. 

Chick (Intercept) 0.025800 0.16062 

Residual 0.025846 0.16077 

Number of obs: 578, groups: Chick, 50 


Fixed effects: 

Estimate Std. Error t value 


(Intercept) 

3.768319 

0.041194 

91.48 

Time 

0.067537 

0.001639 

41.21 

Diet2 

0.048496 

0.071072 

0.68 

Diet3 

0.024561 

0.071072 

0.35 

Diet4 

0.104324 

0.071123 

1.47 

Time:Diet2 

0.008219 

0.002716 

3.03 

Time:Diet3 

0.022093 

0.002716 

8.13 

Time:Diet4 

0.014737 

0.002751 

5.36 

Correlation 

Time 

Diet2 

Diet3 

Diet4 

of Fixed Effects: 
(Intr) Time Diet2 

-0.401 

-0.580 0.232 

-0.580 0.232 0.336 

-0.579 0.232 0.336 

Diet3 Diet4 

0.336 

Time:Diet2 

0.242 -0.603 -0.405 

-0.140 -0.140 
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Time:Diet3 0.242 -0.603 -0.140 -0.405 -0.140 0.364 
Time:Diet4 0.239 -0.596 -0.138 -0.138 -0.406 0.359 0.359 

The fixed effect estimates are found in the "Fixed effects" section of the 
output and are interpreted in the same way as output from the lm func¬ 
tion. In the "Random effects" section, we have parameter estimates as¬ 
sociated with the random effects. In our model, there are two sources of 
variation: a biological variation due to chickens, and a within-chicken 
variability. The Chick estimate of 0.025800 is the variation between 
chickens and Residual (with an estimate of 0.025846) is the variation 
within chickens. Here we have that the biological variation between 
chickens is the same as the variation within each chicken. 

We are interested in testing if there is any difference in weight gain 
rate for the four diets. That corresponds to testing if the interaction 
between diet and time is significant. The anova function can be used 
for model reductions, where the models to be compared are included 
as arguments. When testing fixed effect parameters, we should set the 
REML=FALSE option to obtain maximum likelihood model fits before 
comparing model likelihoods. 

> model <- lmer(log(weight) ~ Time + Diet + Time*Diet + (1|Chick), 

+ data=ChickWeight, REML=FALSE) 

> noint <- lmer(log(weight) ~ Time + Diet + (1|Chick), 

+ data=ChickWeight, REML=FALSE) 

> anova(model, noint) 

Data: ChickWeight 
Models: 

noint: log(weight) ~ Time + Diet + (1 | Chick) 

model: log(weight) ~ Time + Diet + Time * Diet + (1 | Chick) 

Df AIC BIC logLik Chisq Chi Df Pr(>Chisq) 

noint 7 -272.43 -241.91 143.22 

model 10 -335.24 -291.64 177.62 68.807 3 7.684e-15 *** 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

The interaction between diet and time is clearly significant (p < 0.0001) 
and we fail to reject the hypothesis that the weight gain rate is the same 
for all four diets. From the estimates shown above, we find that the 
weight gain rate increases with diet. Chickens on diet 1 have an esti¬ 
mated rate of exp(0.067537) = 1.0699, so on average they increase their 
weight by 6.99% per day. Chickens on diet 4 increase their weight by 
a factor exp(0.067537 + 0.014737) = 1.085753, so on average 8.58% per 
day. 

It is also possible to test if any of the random effect terms are equal 
to zero by comparing two model using the anova function. However, 
in this situation the p-value result from anova has twice the correct 
size since the hypothesis that the corresponding variance component 
is equal to zero is on the boundary of the parameter space. Thus, the 
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p- values computed by anova should be halved when testing simple 
random effects. 

See also: The book by Pinheiro and Bates (2000) gives a thorough intro¬ 
duction to mixed effects models (and mixed-effects model formulas) in 

R. 


3.13 Fit a linear mixed-effects model with serial correla¬ 
tion 

Problem: You want to fit a linear mixed-effects model which allows for 
both serial correlation and where some terms may be random effects. 

Solution: Mixed models are statistical models that include not only 
traditional "fixed effects" as used in linear and generalized linear mod¬ 
els but also "random effect" terms. Here, we will consider linear mixed 
effect models, which accommodates both nested random effects as well 
as serial correlation. 

Random effects induce correlation between observations from a clus¬ 
ter, for example, when there are repeated measurements on each indi¬ 
vidual. However, the correlation induced by the random effect terms 
typically assumes that all observations within the cluster are equally 
correlated. In some situations — in particular if measurements are re¬ 
peated in time or space — it may be more reasonable to assume that 
measurements that are close to each other in time/space are potentially 
more correlated than observations that are further away in time/space. 
Serial correlation models the residual correlation structure for a cluster 
and can let the correlation depend on the difference between observed 
variables such as for example time or distance. 

Rule 3.12 used the lmer function since that easily handles multiple ran¬ 
dom effects. However, lmer cannot accommodate serial correlation so 
here we use the extremely versatile lme function from the nlme pack¬ 
age instead, lme only easily handles nested random effects, but on the 
other hand it can model serial correlation as well as variance hetero¬ 
geneity. 

Random effects terms are specified differently in lme than in lmer. lme 
expects two model formulas — one for the fixed effects terms, which 
are specified as usual, and one for the random effect terms through the 
random argument. The random effect terms can be specified in various 
ways, and here we only consider the simplest situation, where the ran¬ 
dom effect has the form ~ 1 | group. This corresponds to a random 
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intercept model, where group determines the grouping or clustering 
structure. The ~ 1 | group random effect term in lme corresponds 
to the random effect term (1 | group) in the syntax of Imer. 

The serial correlation is specified through the correlation argument, 
correlation accepts a correlation structure function, and R comes 
with a few predetermined, including for example corARl, corExp, 
corGaus, and corRatio representing first-order auto-regressive, ex¬ 
ponential, Gaussian and rational quadratic correlation structures, re¬ 
spectively. The correlation structures determine the within-group cor¬ 
relation, and the main argument to the correlation structure function is 
f o rm, which expects a one-sided formula that determines both the clus¬ 
tering of the residual correlation structure — observations from differ¬ 
ent clusters are assumed to be independent — and how the correlation 
is affected by the "distance" between observations within the cluster. 
For example, we should include form = ~ time | subject as argu¬ 
ment to the correlation structure function to specify that the correlation 
depends on the (differences between) time within each subject. If the 
option nugget is set to TRUE, then a nugget parameter is included in 
the model for the correlation structure. 

In the following we use the ChickWeight data frame to model the log¬ 
arithm of chicken weight over time for different diets. There are up to 
12 measurements on each chicken from birth to 21 days old, and each 
chicken has received one of four experimental feeds. An interaction be¬ 
tween time and diet is included to allow for different diets to influence 
the rate of the weight gain. A random effect of chicken is included to 
account for a biological variation between chickens since we are not in¬ 
terested in the particular chickens in the data but only consider them 
a random sample from the population of chickens. We include a serial 
correlation structure (over time for each chicken) to account for the fact 
that measurements taken close in time on the same animal are likely to 
be more correlated than measurements taken further apart in time on 
the same animal. Initially, we assume that the serial correlation struc¬ 
ture can be represented by a Gaussian spatial correlation structure over 
time. Thus, mathematically the model is 

Vij = Xij(3 + bi + Kij + eij 

where y, y is the yth measurement on the /th individual, X, :l is the design 
matrix, (3 is the vector of fixed effect parameters, h t ~ A'(0. v 2 ) is the 
random effect for each chicken, k ~ N{ 0, \&) is the serial correlation, 
and e ~ N( 0 .a 2 ) is the individual measurement error. Two different 
measurements on the same chicken will have covariance 

co v(yij,yij‘) = C 1 + ip 2 -exp (h__UL) 
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where v 2 is due to the random effect term and the remaining correla¬ 
tion is due to the serial correlation. The inclusion of a nugget effect is 
the reason the c ?; s are included in the model so the total variance for a 
single observation becomes 


var (yij) = v 2 + + a 1 . 


> library(nlme) 

> data(ChickWeight) 

> head(ChickWeight) 

Grouped Data: weight ~ Time 

weight Time Chick Diet 


1 42 0 

2 51 2 

3 59 4 

4 64 6 

5 76 8 

6 93 10 


1 1 
1 1 
1 1 
1 1 
1 1 
1 1 


Chick 


> model <- lme(log(weight) ~ Diet + Time + Time*Diet, 
random= ~ 1|Chick, 

correlation=corGaus(form= ~ Time|Chick, 
nugget=TRUE), 

data=ChickWeight) 


model 


Linear mixed-effects model fit by REML 
Data: ChickWeight 

Log-restricted-likelihood: 717.3723 
Fixed: log(weight) ~ Diet + Time + Diet * Time 
(Intercept) Diet2 Diet3 Diet4 

3.777869868 -0.016298711 -0.016190877 0.005659405 

Time Diet2:Time Diet3:Time Diet4:Time 
0.060117705 0.013134919 0.024021147 0.020730733 


Random effects: 

Formula: ~1 | Chick 

(Intercept) Residual 
StdDev: 0.03121003 0.2517433 


Correlation Structure: Gaussian spatial correlation 
Formula: ~Time | Chick 
Parameter estimate(s): 
range nugget 

11.02202288 0.01393186 

Number of Observations: 578 
Number of Groups: 50 


Model reductions and comparisons of nested models are done using the 
anova function, and maximum likelihood estimation should be used 
when testing fixed effects instead of the default restricted maximum 
likelihood. To get maximum likelihood estimates in lme we need to set 
the optional argument method="ML". 
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> model2 <- lme 
+ 

+ 

+ 

+ 

> mode13 <- lme 
+ 

+ 

+ 

> anova(model3, 


(log(weight) ~ Diet + Time + Time*Diet, 
random= ~ 1|Chick, 

correlation=corGaus(form= ~Time|Chick, 
nugget=TRUE), 

method =,, ML", data=ChickWeight) 

(log(weight) ~ Diet + Time, random= ~ 1|Chick, 
correlation=corGaus(form= ~Time|Chick, 
nugget=TRUE), 

method =,, ML", data=ChickWeight) 
mode12) 


Model df 
model2 1 12 
mode13 2 9 


AIC BIC logLik 
-1463.379 -1411.064 743.6894 
-1448.402 -1409.166 733.2012 


Test L.Ratio 
1 vs 2 20.97638 


p-value 

model2 

model3 le-04 


> summary(model) 

Linear mixed-effects model fit by REML 
Data: ChickWeight 

AIC BIC logLik 

-1410.745 -1358.597 717.3723 


Random effects: 

Formula: ~1 | Chick 

(Intercept) Residual 
StdDev: 0.03121003 0.2517433 

Correlation Structure: Gaussian spatial correlation 
Formula: ~Time | Chick 
Parameter estimate(s): 


range nugget 

11.02202288 0.01393186 

Fixed effects: log(weight) ~ Diet + Time + Diet * Time 



Value 

Std.Error 

DF 

t-value 

p-value 

(Intercept) 

3.777870 

0.05328731 

524 

70.89624 

0.0000 

Diet2 

-0.016299 

0.09210101 

46 

-0.17697 

0.8603 

Diet3 

-0.016191 

0.09210101 

46 

-0.17579 

0.8612 

Diet4 

0.005659 

0.09211534 

46 

0.06144 

0.9513 

Time 

0.060118 

0.00350522 

524 

17.15091 

0.0000 

Diet2:Time 

0.013135 

0.00592672 

524 

2.21622 

0.0271 

Diet3:Time 

0.024021 

0.00592672 

524 

4.05302 

0.0001 

Diet4:Time 

0.020731 

0.00596089 

524 

3.47779 

0.0005 


Correlation: 



(Intr) 

Diet2 

Diet3 Diet4 Time 

Dt2:Tm 

Dt3:Tm 

Diet2 

-0.579 





Diet3 

-0.579 

0.335 




Diet4 

-0.578 

0.335 

0.335 



Time 

-0.651 

0.377 

0.377 0.377 



Diet2:Time 

0.385 

-0.662 

-0.223 -0.223 -0.591 



Diet3:Time 

0.385 

-0.223 

-0.662 -0.223 -0.591 

0.350 


Diet4:Time 

0.383 

-0.221 

-0.221 -0.660 -0.588 

0.348 

0.348 


Standardized Within-Group Residuals: 
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Min Q1 Med Q3 Max 

-3.8888672 -0.1629101 0.2985462 0.8735338 2.8966671 

Number of Observations: 578 
Number of Groups: 50 

We reject the additive model (p = 0.0001), so the effect of time on weight 
depends on the type of diet. The estimates of the final model are ob¬ 
tained by summary from the final fitted object using restricted maxi¬ 
mum likelihood. The weight gain per time is largest for diets 3 and 4 
(0.0601 + 0.0240 and 0.0601 + 0.02073, respectively) and smallest for diet 
1. Recall that this is on the log scale, so we use the exponential function 
to back-transform the estimates to the original scale; e.g., the weight in¬ 
creases on average by a factor exp(0.0601 + 0.0240) = 1.088 per day for 
diet 3. 

The "Random Effects" section of the summary output is split into two 
parts: one related to the random effect terms and one related to the 
serial correlation structure. The estimates for the standard deviation 
found for "Intercept" and "Residual" correspond to estimates of the 
between chicken ( v 2 ), and the combined measurement error and within- 
chicken standard deviations (<r 2 + b 2 ), respectively. The standard de¬ 
viation between chickens (0.03121) is much smaller than the standard 
deviation within chickens (0.2517). The serial correlation structure is 
shown to have the "range" of 11.0220 which is an estimate of <f> and 
measures how far the autocorrelation stretches (presumably, the auto¬ 
correlation is essentially zero beyond the range). Due to the parame¬ 
terization used in lme, the value of 0.0139 listed under "nugget" effect 
is <t 2 /(<t 2 + i/j 2 ). The nugget represents extra variability at distances 
smaller than the typical sample distance. 

We can test the hypothesis that the nugget effect is zero. Since the 
nugget effect is part of the variance structure, we compare the model 
without the nugget effect to the model with the nugget effect using re¬ 
stricted maximum likelihood. 

> model4 <- lme(log(weight) ~ Diet + Time + Time*Diet, 

+ random^ ~ 1|Chick, 

+ correlation=corGaus(form= ~ Time|Chick), 

+ data=ChickWeight) 

> anova(model4, model) 

Model df AIC BIC logLik Test L.Ratio 

model 1 12 -1410.7445 -1358.5969 717.3723 

mode14 2 11 -963.6855 -915.8835 492.8428 1 vs 2 449.059 

p-value 

model 

model4 <.0001 

We reject the hypothesis that the nugget effect is zero since p < 0.0001. 
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Figure 3.5: The left panel shows the semi-variogram for the empirical serial 
correlation structure (points) and model-based correlation structure (lines). The 
right panel shows the semi-variogram with the actual and smoothed empirical 
correlations after the serial correlation structure of the model has been factored 
out. 


Different types of correlation structures can be compared using Akaike's 
information criterion (AIC), which are extracted from a model fit using 
the AIC function. When comparing fitted objects, the smaller the AIC, 
the better the fit. 

> model5 <- lme(log(weight) ~ Diet + Time + Time*Diet, 

+ random= ~ 1|Chick, 

+ correlation=corExp(form= ~ Time|Chick, 

+ nugget=TRUE), 

+ data=ChickWeight) 

> AIC(model) 

[1] -1410.745 

> AIC(model5) 

[1] -1177.013 

The Gaussian correlation structure provides a substantially better fit to 
the data than the exponential correlation structure. 

The adequacy of the serial correlation structure can be assessed graph¬ 
ically using a semi-variogram. The semi-variogram plots both the em¬ 
pirical serial correlations and the serial correlations from the model. 
The Variogram function computes a semi-variogram which can be 
plotted directly. By default, Pearson residuals are computed (an ex¬ 
ample of which are shown in the left plot of Figure 3.5) but "normal¬ 
ized" residuals (where the serial correlation structure from the model 
has been factored out) can be computed with the resType="n" argu¬ 
ment (right panel of Figure 3.5). 


www.Ebook777.com 





110 


The R Primer 


> plot(Variogram(model)) 

> plot(Variogram(model, resType="n")) 

Figure 3.5 shows that the empirical and model-based semi-variograms 
match nicely until a distance around 12. The same trend is shown for 
the normalized residuals, where the model-based correlation structure 
has been factored out. Ideally, the right panel of Figure 3.5 should be 
a flat line to show that the normalized residuals are constant, but there 
seems to be a remaining increase in variance for large distances. 

See also: The book by Pinheiro and Bates (2000) gives a thorough intro¬ 
duction to mixed effects models (and mixed-effects model formulas) in 

R. 


3.14 Fit a generalized linear mixed model 

Problem: You want to fit a generalized linear mixed-effects model 
where some of the model terms are considered random effect. 

Solution: The class of generalized linear mixed models extends lin¬ 
ear mixed-effects models in the same way as generalized linear models 
extends linear models. 

The glmer function from the lme4 package implements generalized 
linear mixed-effects models in R. glmer extends the lmer function 
described in Rule 3.12 in the same way as glm extends lm to general¬ 
ized linear models. The family option sets the error distribution for 
glmer, and the same families as for glm can be applied except for the 
quasi-binomial and quasi-Poisson families which are used to account 
for overdispersion. 

The summary and anova functions should be used to extract informa¬ 
tion on parameter estimates and for model reductions, respectively. By 
default, glmer uses maximum likelihood to find the parameter esti¬ 
mates except if the family is Gaussian in which case restricted maxi¬ 
mum likelihood is used. The nACQ argument sets the number of points 
used for the adaptive Gauss-FIermite approximation of the likelihood. 
It takes a positive integer, and when increased it increases the number 
of points at the cost of speed. 

In this example we use the cbpp data frame from the lme4 package to 
illustrate a generalized linear logistic regression mixed-effects model, 
cbpp contain information on the incidence of the contagious bovine 
pleuropneumonia (cbpp) disease from 15 different commercial herds 
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in Africa. Blood samples were collected quarterly to determine cbpp 
status and we would like to include a random effect of herd to account 
for any correlation between incidence measurements taken on the same 
herd over time. 


> library(lme4) 

> data(cbpp) 

> model <- glmer(cbind(incidence, size-incidence) ~ 

+ period + (l|herd), family=binomial, data=cbpp) 

> summary(model) 

Generalized linear mixed model fit by the Laplace approximation 
Formula: cbind(incidence, size - incidence) ~ period + (1 | herd) 
Data: cbpp 

AIC BIC logLik deviance 
110.1 120.2 -50.05 100.1 

Random effects: 

Groups Name Variance Std.Dev. 

herd (Intercept) 0.4125 0.64226 

Number of obs: 56, groups: herd, 15 

Fixed effects: 



Estimate Std 

. Error z 

value 

Pr (>| z | ) 

(Intercept) 

-1.3985 

0.2279 

-6.137 

8.42e-10 *** 

period2 

-0.9923 

0.3054 

-3.249 

0.001156 ** 

period3 

-1.1287 

0.3260 

-3.462 

0.000537 *** 

period4 

-1.5804 

0.4288 

-3.686 

0.000228 * * * 

Signif. codes: 0 '***' 

0.001 '** 

o 

o 

I- 1 

0.05 


Correlation of Fixed Effects: 

(Intr) perid2 perid3 
period2 -0.351 
period3 -0.329 0.267 

period4 -0.249 0.202 0.186 

> model2 <- glmer(cbind(incidence, size-incidence) ~ 1 + 

+ (1|herd), family=binomial, data=cbpp) 

> anova(model2, model) 

Data: cbpp 

Models: 

model2: cbind(incidence, size - incidence) ~ 1 + (1 | herd) 
model: cbind(incidence, size - incidence) ~ period + (1 | herd) 

Df AIC BIC logLik Chisq Chi Df Pr(>Chisq) 
model2 2 129.71 133.76 -62.853 

model 5 110.10 120.22 -50.048 25.61 3 1.151e-05 *** 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

We can see from the anova output that the incidence rate at the four pe¬ 
riods are significantly different (j> < 0.00001). Also, since the contrasts 
for period 2-4 relative to period 1 are all negative and significantly dif¬ 
ferent from zero it looks as if period 1 has the largest incidence of cbpp. 
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after which it decreases. The odds ratio for contagious bovine pleuro¬ 
pneumonia when comparing period 2 to period 1 is 


> exp(-0.9923) 

[1] 0.3707230 

so the average odds of cbpp is a factor 2.697(= 1/0.3707) smaller for 
period 2 compared to period 1. The herd estimate found in the "Ran¬ 
dom effects" section is 0.4125 which is the (Gaussian) variance of the 
herd random effects on the logit scale. Based on this random effects es¬ 
timate, the odds ratio is exp(2 • 0.64226) = 3.6129 for moderately high 
incidence herds (one standard deviation above the average) compared 
to moderately low incidence herds (one standard deviation below the 
average). 

As mentioned above, glmer cannot use the quasi-binomial and quasi- 
Poisson families to accommodate overdispersion. Instead we can in¬ 
clude a random effect for each observation to introduce extra, individ¬ 
ual, variation to each observation. We do that in the code below. 

> cbpp$obs <- l:nrow(cbpp) 

> model3 <- glmer(cbind(incidence, size - incidence) ~ period + 

+ (1|herd) + (l|obs), family=binomial, data=cbpp) 

Number of levels of a grouping factor for the random effects 
is *equal* to n, the number of observations 

> summary(model3) 

Generalized linear mixed model fit by the Laplace approximation 
Formula: cbind(incidence, size - incidence) ~ period + 

(1 | herd) + (1 | obs) 

Data: cbpp 

AIC BIC logLik deviance 
102.7 114.8 -45.34 90.68 

Random effects: 

Groups Name Variance Std.Dev. 

obs (Intercept) 0.794023 0.89108 
herd (Intercept) 0.033835 0.18394 
Number of obs: 56, groups: obs, 56; herd, 15 

Fixed effects: 

Estimate 
(Intercept) -1.5003 
period2 -1.2265 

period3 -1.3288 

period4 -1.8663 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

Correlation of Fixed Effects: 

(Intr) perid2 perid3 
period2 -0.594 
period3 -0.576 0.353 

period4 -0.476 0.292 0.282 


Std. Error z value Pr(>|z|) 

0.2888 -5.196 2.04e-07 *** 

0.4735 -2.591 0.00958 ** 

0.4884 -2.721 0.00651 ** 

0.5906 -3.160 0.00158 ** 
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The fixed effects estimates change slightly when overdispersion is in¬ 
troduced but the largest change is for the random effects. The majority 
of the variation from the random effects is due to overdispersion and 
not because of difference among the herds. 

See also: Rules 3.8 and 3.10 give examples of generalized linear mod¬ 
els and Rule 3.12 gives an introduction to linear mixed-effects models 
and how to include random effect terms in the model formula. The 
glmm function from the repeated package can also fit generalized lin¬ 
ear mixed model with a random intercept. 


3.15 Fit a generalized estimating equation model 

Problem: You want to use generalized estimating equations to esti¬ 
mate the parameters in a model for clustered / dependent observations, 
where a correlation structure may be present but where the underlying 
model for the correlation structure is unspecified. 

Solution: Correlated or dependent data are common in many exper¬ 
iments. Sometimes it is possible to correctly model the origin of the 
dependency among observations that gives rise to the correlation struc¬ 
ture. In other situations, however, it may be assumed that some obser¬ 
vations are correlated for example due to clustering but the exact cor¬ 
relation structure may be unknown. Thus, additional assumption may 
be needed to formulate a full likelihood for the data and even then the 
likelihood may prove to be intractable. 

Generalized estimating equations (GEE) are a convenient and general 
approach to the analysis of, for example, generalized linear models 
when correlation is present but where the underlying process that gen¬ 
erates the correlation is unspecified. The main advantage of the gener¬ 
alized estimating approach is that it produces unbiased estimates of the 
fixed-effect parameters even if the correlation structure is misspecified. 
Generalized estimating equations can be solved with the gee function 
found in the gee package. The gee function accepts a model formula 
as its first argument, as well as the arguments id and corstr, which 
are used to define the clustering and the correlation structure within 
clusters, respectively. The id argument should be a vector or a factor of 
the same length as the number of observations that defines the individ¬ 
ual clusters. Note that the input data frame should be sorted according 
to this clustering variable so observations within a cluster are in con¬ 
tiguous rows. 
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The simplest form of correlation structure is the independence model 
which is also the default correlation structure for gee. It assumes no 
correlation within clusters and an identity matrix is used to represent 
the correlation structure within a cluster. Other correlation structures 
include corstr="exchangeable" for the exchangeable correlation 
structure (i.e., when observations within a cluster are assumed to have 
a common correlation), a completely unstructured correlation structure 
("unstructured"), auto-regressive banded (" AR-M") and a fixed cor¬ 
relation structure. The fixed correlation structure, corstr="fixed", 
allows the user to specify the correlation structure directly by provid¬ 
ing a square matrix of dimensions at least the size of the largest clus¬ 
ters through the argument R. Smaller clusters will use the appropriate 
upper-left square matrix of R to determine the correlation structure. For 
the banded correlation structures, the Mv option accepts a positive inte¬ 
ger and sets the width of the band. 

The family argument defines the error distribution family just as for 
the glm function. The default distribution is gaussian, but gee also 
accepts, for example, binomial, poisson, as well as their quasi¬ 
versions. 

As an example we use the same data as in Rule 3.14 and analyze the 
cbpp data from the lme 4 package using generalized estimating equa¬ 
tions. We wish to use a logistic regression model to model the incidence 
of the contagious bovine pleuropneumonia (cbpp) disease from 15 dif¬ 
ferent commercial herds in Africa. Blood samples were collected quar¬ 
terly to determine cbpp status and we would like to account for possible 
dependency of observations due to incidence measurements taken on 
the same herd over time. Thus, herd should be considered a cluster in 
the analysis and we use the exchangeable correlation structure. 

> library(gee) 

> library(lme4) 

> data(cbpp) 

> cbpp <- cbpp[order(cbpp$herd),] # Order data according to herd 

> model <- gee(cbind(incidence, size-incidence) ~ period, 

+ corstr="exchangeable", family=binomial, 

+ id=herd, data=cbpp) 

Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27 
running glm to get initial regression estimate 
(Intercept) period2 period3 period4 

-1.269023 -1.170763 -1.301405 -1.782279 

> summary(model) 

GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA 

gee S-function, version 4.13 modified 98/01/27 (1998) 

Model: 

Link: Logit 

Variance to Mean Relation: Binomial 
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Correlation Structure: Exchangeable 

Call: 

gee(formula = cbind(incidence, size - incidence) ~ period, 
id = herd, corstr = "exchangeable", data = cbpp, 
family = binomial) 

Summary of Residuals: 

Min IQ Median 3Q Max 

-0.21925413 -0.05239724 0.91914467 1.93468856 11.78074587 


Coefficients: 



Estimate 

Naive S.E. 

Naive z 

Robust S.E. 

(Intercept) 

-1.270018 

0.2139396 

-5.936341 

0.2753930 

period2 

-1.160764 

0.4242832 

-2.735823 

0.4370163 

period3 

-1.289817 

0.4553427 

-2.832629 

0.4924785 

period4 

-1.763369 

0.6004957 

-2.936522 

0.3793846 


Robust z 




(Intercept) 

-4.611657 




period2 

-2.656111 




period3 

-2.619031 




period4 

-4.647973 





Estimated Scale Parameter: 2.178349 
Number of Iterations: 2 


Working Correlation 

[, 1 ] [, 2 ] 
[1,] 1.00000000 0.02743670 
[2,] 0.02743670 1.00000000 
[3,] 0.02743670 0.02743670 
[4,] 0.02743670 0.02743670 


[, 3] 

0.02743670 
0.02743670 
1.00000000 
0.02743670 


[, 4 ] 

0.02743670 
0.02743670 
0.02743670 
1.00000000 


The estimates are obtained through the summary function and have 
the same order of magnitude as we saw in the generalized linear mixed 
model analysis in Rule 3.14. The output from summary also gives both 
naive and robust standard errors for the parameter estimates. Gener¬ 
ally, the robust standard errors are preferred but here we see that only 
for period 4 is there any real difference between the robust and the naive 
standard errors. 

There are no build-in functions to test parameters, so we should rely 
on the robust z values which are approximately normally distributed. 
To test several parameters simultaneously, for example in the case of a 
categorical explanatory variable with more than two categories, we can 
use a generalized Wald test. The generalized Wald test is not part of the 
gee package, so we have to compute it ourselves. 

For example to test the hypothesis of no period effect, we should test 
the three parameters corresponding to periods 2, 3, and 4 equal to zero. 
From the order of the coefficients in the output above, we see that pe- 
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riods 2-4 correspond to parameters 2-4 since the intercept is the first 
parameter. To compute the generalized Wald test statistic we need to 
extract the relevant parameter estimates and invert the relevant part of 
the robust variance-covariance matrix of the estimates. This is done us¬ 
ing the coef function to extract the fixed effects coefficients and the 
robust. variance element of the fitted object to extract the robust 
variance. 

> prs <- c(2,3,4) # Choose parameters to test equal to zero 

> VI <- solve(model$robust.variance[prs,prs]) # Invert variance 

> res <- as.numeric(coef(model)[prs] %*% VI %*% coef(model)[prs]) 

> 1-pchisq(res, df=length(prs)) 

[1] 2.066544e-06 

We reject the hypothesis that the three contrasts are equal to zero (p < 
0.0001) and conclude that period is highly significant. 

Even though the generalized estimating equation approach provides 
robust estimates even for misspecifications of the correlation structure, 
we can try to use an unstructured correlation structure to see how dif¬ 
ferent the estimated correlation becomes. 

> model2 <- gee(cbind(incidence, size-incidence) ~ period, 

+ corstr="unstructured", family=binomial, 

+ id=herd, data=cbpp) 

Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27 
running glm to get initial regression estimate 
(Intercept) period2 period3 period4 

-1.269023 -1.170763 -1.301405 -1.782279 

> summary(model2)$coefficients 



Estimate 

Naive S.E. 

Naive z 

Robust S.E. 

(Intercept) 

-1.256728 

0.2117538 

-5.934856 

0.2810060 

period2 

-1.300016 

0.4781961 

-2.718584 

0.4826147 

period3 

-1.234020 

0.4243769 

-2.907839 

0.4681485 

period4 

-1.765320 

0.5805868 

-3.040579 

0.3768504 


Robust z 




(Intercept) 

-4.472247 




period2 

-2.693694 




period3 

-2.635958 




period4 

-4.684406 





> model2$working.correlation 

[,1] [, 2] [, 3] [, 4] 

[1,] 1.00000000 -0.2191938 0.1198793 0.08915527 

[2,] -0.21919377 1.0000000 0.5251137 -0.11135990 

[3,] 0.11987928 0.5251137 1.0000000 -0.25550218 

[4,] 0.08915527 -0.1113599 -0.2555022 1.00000000 

We see that the unstructured correlation structure gives both negative 
and positive correlations, and that the size of the correlations are some¬ 
what larger than what was found with the exchangeable correlation 
matrix. However, the robust standard errors of the parameter estimates 
are not vastly different. 
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See also: The geepack package provides another implementation of 
generalized estimating equations. Generalized estimating equations 
for ordinal regression are implemented by the ogee function from the 
repolr package. 


3.16 Decompose a time series into a trend, seasonal, and 
residual components 

Problem: You have a time series and want to decompose the time 
series into a trend, seasonal, and residual component. 

Solution: Time series analysis is concerned with analysis of an or¬ 
dered sequence of values of a variable typically measured at equally 
spaced time intervals. Frequently sampled time series data are com¬ 
mon in many fields, for example in stock market quotes, daily precip¬ 
itation or temperature, quarterly sales revenue, annual measurements 
of water levels, etc. 

One of the common goals of time series analysis is to decompose the se¬ 
ries into three components: trend, seasonality, and residual/irregular. 
The trend component is the long-term changes in the series, seasonality 
is a systematic, short-term calendar related effect. The residual compo¬ 
nent is what remains after the seasonal and trend components of a time 
series are removed, and it results from short-term fluctuations. 

R has a special data type for use with time series analysis and the t s 
function creates time series objects from vectors, arrays or data frames, 
ts takes a number of arguments besides the actual time series values. 
For a periodic/seasonal time series the frequency argument should 
be set to an integer greater than 1 . frequency states how frequently 
observations are measured per natural time unit, and its default value is 
1 so there is one measurement per natural time unit. For example, with 
monthly measurements we set f requency=12, and for quarterly mea¬ 
surements we can use frequency=4. The optional start and end 
arguments set the time of the first and last observation, respectively. 
Either option should be a single number or a vector of two integers, 
which specify a natural time unit and then number of samples into that 
time unit. 

The stl function makes a seasonal decomposition of a time series into 
trend, seasonal, and residual components using local regression. It 
needs a periodic time series object (i.e., the frequency should be greater 
than 1) as first argument. The only other required argument to stl 


www.Ebook777.com 



118 


The R Primer 


is s . window which should either be the character string "periodic" 
or an odd number giving the span in lags used by the local regression 
window for seasonal extraction. It is possible to change the degree and 
window size for the seasonality, the trend and each subseries through 
the s . degree, t. degree, 1. degree, and t. window, 1. window op¬ 
tions, respectively. The stl function assumes an additive seasonal 
component but we can log-transform the response to use a multiplica¬ 
tive seasonal component for stl. 

We will illustrate the stl decomposition using the monthly totals of 
international airline passengers (in thousands) from 1949 to 1960 found 
in the AirPassengers data frame. A total of 144 observations are 
available. 


> data(AirPassengers) 

> model <- stl(log(AirPassengers), s.window="periodic") 

> plot(model) 

> summary(model) 

Call: 

stl(x = log(AirPassengers), s.window = "periodic") 


Time.series components: 


seasonal 

trend 

Min. 

-2.135277e-01 

Min. 

:4.829389 

1st Qu. 

-9.388938e-02 

1st Qu. 

:5.204604 

Median 

-1.452619e-02 

Median 

:5.550305 

Mean 

1.220580e-09 

Mean 

:5.542506 

3rd Qu. 

7.805111e-02 

3rd Qu. 

:5.914812 

Max. 

2.164004e-01 

Max. 

:6.204752 

remainder 



Min. 

-0.1050193083 



1st Qu. 

-0.0182802028 



Median 

-0.0008257820 



Mean 

-0.0003302444 



3rd Qu. 

0.0179438190 



Max. 

0.0864920261 




IQR: 

STL.seasonal STL.trend STL.remainder data 

0.17194 0.71021 0.03622 0.69453 

% 24.8 102.3 5.2 100.0 


Weights: all == 1 

Other components: List of 5 
$ win : Named num [1:3] 1441 19 13 

$ deg : Named int [1:3] 0 1 1 

$ jump : Named num [1:3] 145 2 2 

$ inner: int 2 
$ outer: int 0 

The four graphs in the left panel of Figure 3.6 show the original time 
series as well as the seasonal, trend, and residual components obtained 
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Figure 3.6: Decomposition of a time series using the stl function (left panel). 
The grey vertical bars along the right-hand side give a relative scale of com¬ 
parison to determine the relative importance of the different components. The 
effect of components with smaller bars dominates the time series. The right 
panel shows the autocorrelation function for the residual component. 


from stl. The relative sizes of the vertical bars in the plot suggest that 
the trend dominates the data series and that the effect of the residual 
component is relatively small. The same information is conveyed in the 
summary output which shows the inter-quartile range of the individual 
components relative to the inter-quartile range of the original data. 

The seasonal, trend and residual values are stored as three time series 
in the time . series element of the result from stl. We can plot the 
autocorrelation function of the residual term using the acf function as 
shown below. The output is shown in the right plot of Figure 3.6. 


> acf(model$time.series[,3]) # Plot autocorrelation of residuals 


Based on the plot of the autocorrelation function there still seems to 
be some periodic variation left in the residual component of the time 
series. 

See also: decompose makes a seasonal decomposition of a time se¬ 
ries using moving averages. The zoo package can handle irregularly 
spaced time series. 
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3.17 Analyze time series using an ARMA model 

Problem: You have a time series and want to analyze it using an auto¬ 
regressive moving average model. 

Solution: Time series analysis is concerned with analysis of an or¬ 
dered sequence of values of a variable typically measured at equally 
spaced time intervals. Frequently sampled time series data are common 
in many fields, for example in stock market quotes, daily precipitation 
or temperature, annual measurements of water levels, etc. 

Rule 3.16 presents a way to decompose a periodic time series into trend, 
seasonality and remainder components to describe the nature of the 
phenomenon represented by the observations. Here we will look at a 
more formal statistical model for predicting future values of the time 
series (also known as forecasting). 

The auto-regressive moving average (ARMA) model can handle both 
serial correlation in the response variable (i.e., an autoregressive pro¬ 
cess) and serial correlation in the error term (i.e., a moving average pro¬ 
cess). The ARMA models are part of the larger class of auto-regressive 
integrated moving average (ARIMA) models where the time sequence 
may be lagged several times before analysis. Models can be summa¬ 
rized as ARIMA(p, d. q), where the three parameters are non-negative 
integers that correspond to the number of autoregressive parameters, 
the order of differencing needed to stationarize the time series, and the 
number of moving average parameters, respectively. 

Random-walk and random-trend models, autoregressive models, and 
exponential smoothing models (i.e., exponential weighted moving av¬ 
erages) are all special cases of ARIMA models. Here, however, we will 
assume that overall trends are removed from the time series such that 
it can be assumed to be stationary; i.e., d = 0, and any seasonality has 
been removed. 

The arima function can fit auto-regressive integrated moving average 
models with Gaussian errors. The function expects a univariate time 
series as first argument, and the number of non-seasonal parameters 
are set by the order argument, which expects a vector with three in¬ 
tegers corresponding to the p, d, and q parameters described above, 
arima can also handle individual modeling of the time series period¬ 
icity through the seasonal option which accepts a list with compo¬ 
nents order and period to describe the model within each natural 
time unit. Other options include xreg which should be a vector or 
matrix containing external explanatory variables with the same num¬ 
ber of rows as the number of observations, and include .mean which 
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is TRUE by default which means that an intercept is included in the 
ARMA model. The time series should be in R's special data type for 
use with time series analysis which can be created by t s as described 
in Rule 3.16. 

The plot function plots the time series, acf will estimate (and by de¬ 
fault also plot) the autocorrelation function, while pacf estimates (and 
plots) the partial autocorrelation function. They all accept a time series 
object as input. 

The greenland dataset from the kulife package contains the av¬ 
erage yearly summer air temperatures from Tasiilaq, Greenland from 
1960 to 2010. We only have one measurement per year so there is no 
obvious seasonality. 

> library(kulife) 

> data(greenland) 

> temp <- ts(greenland$airtemp, start=1960) 

> temp 

Time Series: 

Start = 1960 
End = 2010 
Frequency = 1 

[1] 6.2 6.6 5.8 5.9 5.9 5.9 6.1 4.9 6.0 5.5 4.7 NA NA 5.1 6.0 

[16] 5.8 5.7 6.3 6.5 5.2 5.4 5.3 5.1 4.0 5.1 5.6 5.0 4.8 5.6 4.9 

[31] 6.1 6.1 4.6 4.9 6.2 5.8 5.6 5.7 5.6 5.9 5.8 6.1 6.3 7.6 6.4 

[46] 6.9 6.1 7.0 6.6 6.3 7.8 

> acf(temp, na.action=na.pass) 

> pacf(temp, na.action=na.pass) 


The temperature time series is shown in the left plot of Figure 3.7, while 
the estimated autocorrelation and partial autocorrelation functions are 
shown in the right plot of Figure 3.7. The autocorrelation function 
shows almost exponential decay towards zero indicating an autoregres¬ 
sive model may be appropriate — possibly with a first-order moving 
average to accommodate the non-monotonous decay. The partial au¬ 
tocorrelation function quickly becomes small, so apart from the peak 
at lag 3 we would suggest a first-order autoregressive model to fit the 
data. Thus we fit an ARIMA(1,0,1) model to the data. 

> model <- arima(temp, order=c(1,0,1)) 

> model 

Call: 

arima(x = temp, order = c(l, 0, 1)) 

Coefficients: 

arl mal intercept 

0.9262 -0.6162 6.0256 

s.e. 0.0738 0.1398 0.4065 
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Figure 3.7: Time series with average summer air temperature from Greenland 
from 1960-2010 (left plot). The dashed line shows the future prediction of tem¬ 
perature based on an ARMA model. The right plot shows the estimated auto¬ 
correlation and partial autocorrelation functions from the original time series. 


sigma A 2 estimated as 0.3616: log likelihood = -45.1, aic = 98.2 

The model gives that the observation at time t, is estimated as 


yt = 6.0256 + 0.9262 • yt-i + et H—0.6162 • et-i, 

where yt-i is the previous value of the time series, and e, and e t - 1 are 
independent Gaussian errors for the measurements at time t and t — 1, 
respectively. 

The residuals and autocorrelation function of the residuals can be shown 
with the tsdiag function, and the predict function can be used to 
predict future values of the time series. Both functions take an arima 
fit as input, and the n . ahead argument to predict sets the number 
of steps ahead to predict. The tsdiag output shows no indication of 
an inappropriate model fit (output not shown). Below we predict the 
air temperature for the next 15 years, and plot the original time series 
as well as the prediction (with 95% confidence intervals). The ts . plot 
function plots multiple time series in the same graph, so we just use 
the original series and prediction as input arguments. Also, note how 
the lines function automatically places the confidence bands at the 
correct time interval. The graph is shown in the left plot of Figure 3.7 

> forecast <- predict(model, 15) 

> forecast 
$pred 
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Time Series: 

Start = 2011 
End = 2025 
Frequency = 1 

[1] 6.817040 6.758633 6.704537 6.654433 6.608026 6.565045 

[7] 6.525236 6.488364 6.454214 6.422584 6.393288 6.366154 

[13] 6.341023 6.317747 6.296188 

$se 

Time Series: 

Start = 2011 
End = 2025 
Frequency = 1 

[1] 0.6013092 0.6295481 0.6528001 0.6721059 0.6882360 0.7017777 

[7] 0.7131895 0.7228355 0.7310089 0.7379483 0.7438496 0.7488749 

[13] 0.7531592 0.7568151 0.7599373 

> ts.plot(temp, forecast$pred, lty=l:2) 

> lines(forecast$pred + 1.96*forecast$se, lty=3) 

> lines(forecast$pred - 1.96*forecast$se, lty=3) 

Note that R fits a model without intercept when differences are com¬ 
puted as part of the arima call; i.e., when d, > 0. If there is drift, then 
we want to include an intercept when modeling the difference since 
the intercept will capture constant drift. Appropriate use of the xreg 
option can be used to include an intercept for models with differencing. 

See also: The book by Shumway and Staffer (2010) gives a good in¬ 
troduction to time series modeling in R. The package its implements 
irregular time series. 


Specific methods 


3.18 Compare populations using t test 

Problem: You wish to compare two population means or test a single 
population mean using a t test. 

Solution: The t test tests the hypothesis that two population means 
are identical or that a single population mean has a hypothesized value. 
The t. test function computes one- and two-sample t tests and it ac¬ 
cepts input as either a model formula or as one or two vectors of quan¬ 
titative observations. 
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For the two-sample test, we can input the two vectors of observations 
as the first two arguments. The optional argument alternative spec¬ 
ifies the alternative hypothesis and can be one of two . sided (the de¬ 
fault) for a two-sided test, or less or greater if a one-sided test is de¬ 
sired. The var. equal argument is by default FALSE which produces 
Welch's two-sample t test where the variances in the two populations 
are not assumed to be equal. If the paired=TRUE argument is set, then 
a paired t test is computed. 

The cats data frame from the MASS package contains information on 
the body weight (in kg) and heart weight (in grams) for 144 adult cats. 
We wish to examine if the body weight is the same for male and female 
cats. 

> library(MASS) 

> data(cats) 

> head(cats) 

Sex Bwt Hwt 

1 F 2.0 7.0 

2 F 2.0 7.4 

3 F 2.0 9.5 

4 F 2.1 7.2 

5 F 2.1 7.3 

6 F 2.1 7.6 

> t.test(Bwt ~ Sex, data=cats) 

Welch Two Sample t-test 


data: Bwt by Sex 

t = -8.7095, df = 136.838, p-value = 8.831e-15 
alternative hypothesis: true difference in means is 
not equal to 0 

95 percent confidence interval: 

-0.6631268 -0.4177242 
sample estimates: 
mean in group F mean in group M 
2.359574 2.900000 

The mean body weight for the two groups are found to be 2.3596 and 
2.900 kg for the female and male cats, respectively. The p-value is virtu¬ 
ally zero so we reject the hypothesis that the body weight for males and 
females is identical, and a 95% confidence interval for the difference in 
body weight is [—0.6631; —0.4177]. 

If we assume equal variances in both populations, we set the argument 

var.equal=TRUE. 

> t.test(Bwt ~ Sex, data=cats, var.equal=TRUE) 

Two Sample t-test 
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data: Bwt by Sex 

t = -7.3307, df = 142, p-value = 1.590e-ll 
alternative hypothesis: true difference in means is 
not equal to 0 

95 percent confidence interval: 

-0.6861584 -0.3946927 
sample estimates: 
mean in group F mean in group M 
2.359574 2.900000 

The p-value becomes slightly larger in this situation but we still clearly 
reject the null hypothesis of equal body weights for male and female 
cats. 

To test the mean of a single population, we supply t. t e s t with a single 
numeric vector as input and set the mu argument to the hypothesized 
mean value (default value is zero). 

> X <- c(1.3, 5.5, -2.1, 0.9, -0.4, 1.1) 

> t.test(x, mu=0) 

One Sample t-test 


data: x 

t = 1.018, df = 5, p-value = 0.3554 
alternative hypothesis: true mean is 
not equal to 0 

95 percent confidence interval: 

-1.601357 3.701357 

sample estimates: 
mean of x 
1.05 

We fail to reject the null hypothesis that the population mean is zero 
(p = 0.3554). The 95% confidence interval for the population mean is 
given by [—1.601; 3.701], and the mean of the sample is 1.05. 

See also: See Rule 3.39 for a non-parametric version of the one-sample 
t test and Rule 3.40 for a non-parametric two-sample test. 


3.19 Fit a nonlinear model 

Problem: You wish to fit a nonlinear model to your data and use least 
squares to estimate the corresponding parameters. 

Solution: Nonlinear models are an extension of linear models where 
the response variable is modeled as a nonlinear function of the model 
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parameters but where the residual errors are still assumed to be Gaus¬ 
sian distributed. Nonlinear models are often derived on the basis of 
physical and /or biological considerations, e.g., from differential equa¬ 
tions, where there is some knowledge of the data-generating process. 
In particular, the parameters of a nonlinear model usually have direct 
interpretation in terms of the process under study. 

Least-squares estimates of the parameters from a nonlinear model can 
be obtained with the nl s function, nl s allows for more complex model 
formulas than we usually find for model formulas with linear predic¬ 
tors since n 1 s treats the right-hand side of the formula argument as a 
standard algebraic expression rather than as a linear-model formula. 
The first argument to nl s should be an algebraic formula that expresses 
the relationship between the response variable and the explanatory vari¬ 
ables and parameters. The start option is a named list of parameters 
and corresponding starting estimates. The names found in the start 
list also determine which terms in the algebraic formula are considered 
parameters to be estimated. 

As an example, consider the following four-parameter extension to the 
logistic growth model which is sometimes used to fit fluorescence read¬ 
ings from real-time polymerase chain reaction (PCR) experiments. If we 
let t indicate the cycle number, then the reaction fluorescence at cycle t 
is defined as 


m = 


F m 


l + exp(-Y=) 


■Fb, 


where F max is the maximal reaction fluorescence, c is the fractional cycle 
at which reaction fluorescence reaches half of F max , b is related to the 
slope of the curve and Fb is the background reaction fluorescence. 

The qpcr data frame from the kulife package contains information 
on flourescence levels at different cycles from a real-time PCR experi¬ 
ment. We choose sensible starting values based on prior knowledge or 
from the experiment. Reasonable values for the maximum, F max , the 
background, Fb and c are easily determined in this case. 


> library(kulife) 

> data(qpcr) 

> # Use data from just one of the 14 runs 

> runl <- subset(qpcr, transcript==l & line=="wt") 

> model <- nls(flour ~ fmax/(1+exp(-(cycle-c)/b))+fb, 

+ start=list(c=25, b=l, fmax=l00, fb=0), 

+ data=runl) 

> model 

Nonlinear regression model 

model: flour ~ fmax/(1 + exp(-(cycle - c)/b)) + fb 

data: runl 

c b fmax fb 

29.5932 0.8406 96.7559 3.7226 
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residual sum-of-squares: 53.79 

Number of iterations to convergence: 9 
Achieved convergence tolerance: 7.738e-06 

> summary(model) 

Formula: flour ~ fmax/(l + exp(-(cycle - c)/b)) + fb 


Parameters: 


Estimate 

c 29.59321 

Std. Error 

0.02842 

t value 

1041.30 

Pr (> 11 | ) 
<2e-l6 

k k k 

b 

0.84059 

0.02473 

33.98 

<2e-l6 

k k k 

fmax 96.75594 

0.39650 

244.02 

<2e-l6 

k k k 

fb 

3.72259 

0.22467 

16.57 

<2e-l6 

k k k 

Signif 

. codes: 

0 ' * * * ' 0 

.001 ' * * 

' 0.01 '* 

' 0 


Residual standard error: 1.145 on 41 degrees of freedom 

Number of iterations to convergence: 9 
Achieved convergence tolerance: 7.738e-06 


1 


> # Plot the original data 

> plot(runl$cycle, runl$flour, 

+ xlab="Cycle", ylab="Fluorescence") 

> lines(runl$cycle, predict(model)) # Add the fitted line 

The least squares estimates for the four parameters are seen in the out¬ 
put with c = 29.59,6 = 0.8406, -F max = 96.756 and Fb = 3.723. Fig¬ 
ure 3.8 shows a plot of the observed data and the predicted curve, and 
we observe a nice fit of the model to the data. As usual, summary pro¬ 
vides information on the individual parameters and their standard er¬ 
ror and lists the residual standard error. 

The nls function can be quite sensitive to starting values, and even 
though the function will run without specifying values for the param¬ 
eters with the start option, it is not uncommon to run into conver¬ 
gence problems. The control argument sets control settings for nls, 
and commonly used control settings are maxiter, which changes the 
maximum number of iterations allowed and tol which sets the tol¬ 
erance levels. The control parameters are easily changed using the 
nls . control function since it returns a list with the five components 
that the control argument in nls accepts. 

The algorithm option to nls specifies the algorithm to use for the 
least squares estimation. The default algorithm is Gauss-Newton, but 
other options are port for the nl2sol algorithm and plinear for the 
Golub-Pereyra algorithm for partially linear least-squares models. 
Constraints on the parameter values can be built into the nonlinear least 
squares fit through the lower and upper arguments. If either of them 
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Figure 3.8: Fit of a four-parameter non-linear regression model to real-time PCR 
data. 


are specified they should be a vector with the same length as the start 
list, and they represent the lower and upper bounds, respectively, of the 
parameters. Bounds can only be used with the port algorithm. 

For example, to fit the same nonlinear model but with a lower bound 
for b at 0.9, with the maximum number of iterations set to 100, and 
relative tolerance to 10 -8 , we use the following code. 

> model2 <- nls(flour ~ fmax/(1+exp(-(cycle-c)/b))+ fb, 

+ start=list(c=25, b=l, fmax=100, fb=0), 

+ lower=c(-Inf, .9, -Inf, -Inf), algorithm="port", 

+ control=nls.control(maxiter=100, tol=le-08), 

+ data=runl) 

> mode12 

Nonlinear regression model 

model: flour ~ fmax/(1 + exp(-(cycle - c)/b)) + fb 

data: runl 

c b fmax fb 

29.589 0.900 97.051 3.609 

residual sum-of-squares: 60.98 

Algorithm "port", convergence message: 

both X-convergence and relative convergence (5) 

The parameter estimates have changed a little since the b parameter 
reached its lower boundary. The restriction on the b parameters has 
also caused the residual sum-of-squares to increase from 53.79 to 60.98. 

See also: The nls package provides additional functions for nonlin¬ 
ear regression. The nls2 package adds a brute-force algorithm to nls 
and allows for multiple starting values. The selfStart function can 
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be used to create a self-starting nonlinear model fit, where the starting 
values do not have to be specified by the user. 


3.20 Fit a Tobit regression model 

Problem: You want to fit a Tobit regression model to a dataset because 
some of the observed responses are left-censored. 

Solution: The classical Tobit regression model (also called the cen¬ 
sored regression model) estimates a linear regression model for a left- 
censored quantitative response variable. Left-censored data are com¬ 
mon in situations where data can only be observed under certain con¬ 
ditions (e.g., when observations can only be measured when they are 
above a detection threshold or when negative values are not possible). 
Tobit regression can be fitted directly using the survreg function from 
the survival package, but the tobit function from the AER package 
provides a wrapper function to survreg as well as some additional 
functionality including easy handling of both left and right censoring. 
The tobit function takes a regular model formula as first argument, 
and the left and right arguments set the left and right limit for cen¬ 
soring of the response variable. By default, the left value is set to zero 
and the right value to infinity to indicate that there is no right censor¬ 
ing and that values are left-censored at zero. Censoring of the response 
variable is done automatically based on the left and right options 
so we do not have to modify the data frame directly. 

The Affairs data from the AER package shows infidelity statistics 
from 601 individuals, where the affairs response variable contains 
information on the number of extramarital sexual intercourses during 
the previous year. The majority of observations are zeros since 451 out 
of the 601 observations have had no extramarital intercourse. We would 
like to examine how the variables gender, age and children affect 
the number of affairs. 

> library(AER) 

> data(Affairs) 

> model <- tobit(affairs ~ age + gender + children, data=Affairs) 

> summary(model) 

Call: 

tobit(formula = affairs ~ age + gender + children, data=Affairs) 
Observations: 
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Total Left-censored Uncensored Right-censored 

601 451 150 0 

Coefficients: 



Estimate 

Std. Error 

z value 

Pr (> | z | ) 

(Intercept) - 

10.25012 

2.02135 

-5.071 

3.96e-07 ** 

age 

0.03566 

0.05847 

0.610 

0.54194 

gendermale 

0.62601 

0.99912 

0.627 

0.53095 

childrenyes 

3.46515 

1.28159 

2.704 

0.00686 * * 

Log(scale) 

2.22858 

0.06822 

32.669 

< 2e-l6 ** 

Signif. codes 

i 0 ' * * * 

' 0.001 ' ** 

o 

o 

I- 1 

0.05 


Scale: 9.287 
Gaussian distribution 

Number of Newton-Raphson Iterations: 3 
Log-likelihood: -738.9 on 5 Df 

Wald-statistic: 10.98 on 3 Df, p-value: 0.011840 

summary provides a table of left-, right-, and uncensored observations 
and we note that 451 of our observations are censored at zero since a 
negative number of affairs is not possible. The estimates for the ex¬ 
planatory variables are interpreted as usual for a linear model, i.e., the 
estimate of gendermale suggests that on average males have 0.62601 
more affairs than females, but the number for males is not significantly 
larger than for females (since the p-value is 0.53095). The Log (scale) 
line shows the logarithm of the estimate for the residual variance, which 
corresponds to the value found for Scale. Hence, the residual stan¬ 
dard error is \/9.287 = 3.0475. 

The dropl function tests each explanatory variable by removing the 
terms from the model formula and comparing the fit to the original 
model, dropl can be used for model reduction for tobit regression 
by setting the option test="Chisq" to ensure that dropl computes 
likelihood ratio test statistics and corresponding p-values based on the 
X 2 distribution. 

> dropl(model, test="Chisq") 

Single term deletions 


Model: 

Surv(ifelse(affairs<=0, 0, affairs), affairs>0, type="left") ~ 
age + gender + children 

Df AIC LRT Pr(Chi) 

<none> 1487.7 

age 1 1486.1 0.3700 0.543014 

gender 1 1486.1 0.3938 0.530328 

children 1 1493.4 7.6828 0.005575 ** 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ’ 1 
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The children explanatory variable is statistically significant with a 
likelihood ratio test statistic of 7.6828 and p-value of 0.005575. age or 
gender (or possibly both) are not significant. 

See also: The z e 1 i g function from the Z e 1 i g package also implements 
tobit regression models. 


Model validation 


3.21 Test for normality of a single sample 

Problem: You wish to test if a sample of residuals or observations 
follows a normal distribution. 

Solution: Many statistical models require the residuals (or sometimes 
the observations themselves) to follow a normal distribution for the re¬ 
sulting p-values and confidence intervals to be correct. 

Normality of a sample can be assessed using graphical methods but 
normality can also be tested statistically. Here, we will cover four om¬ 
nibus tests for normality of a sample: Shapiro-Wilk's test, Lilliefors- 
Kolmogorov-Smirnov's test, Cramer von Mises test, and the Anderson- 
Darling test. The four tests consider different departures from normal¬ 
ity and collectively they provide a good overall examination of different 
aspects of normality. 

Shapiro-Wilk's test is computed by the shapiro. test function while 
the Anderson-Darling, Cramer von Mises, and Lilliefors-Kolmogorov- 
Smirnov tests are found in the n o r t e s t package as functions a d. t e s t, 
cvm.test, and lillie .test. All four functions accept a single vec¬ 
tor of observations as input. 

Below we compute the tests for two different datasets — one from a 
normal distribution and one from a right-skewed distribution. 

> library(nortest) 

> x <- rnorm(100, mean=5, sd=2) # Normal distribution 

> y <- (rnorm(100, mean=5, sd=1.5))**2 # Right-skewed distrib. 

> shapiro.test(x) 

Shapiro-Wilk normality test 


data: x 

W = 0.9956, p-value = 0.9876 
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> ad.test(x) 

Anderson-Darling normality test 

data: x 

A = 0.1602, p-value = 0.947 

> cvm.test(y) 

Cramer-von Mises normality test 

data: y 

W = 0.1885, p-value = 0.007331 

> lillie.test(y) 

Lilliefors (Kolmogorov-Smirnov) normality test 


data: y 

D = 0.0894, p-value = 0.04725 

The Shapiro-Wilk's and Anderson-Darling tests give very high p- values 
(0.9876 and 0.947) so there is no compelling evidence for non-normality 
for the normally distributed data. For the non-normal data, both the 
Cramer von Mises and Lilliefors-Kolmogorov-Smirnov tests reject the 
hypothesis of normality. 

Note that the ks .test function from base R cannot directly be used 
to compute the Kolmogorov-Smirnov test since it does not account for 
the fact that the parameters used with the theoretical distribution are 
estimated from the data. 

See also: Rules 4.5, 4.17, and 4.18 show examples of ways to examine 
the distribution of a sample graphically. 


3.22 Test for variance homogeneity across groups 

Problem: You wish to test the assumption of variance homogeneity for 
an analysis for variance model. 

Solution: One of the assumptions of analysis of variance is that the 
variances of the observations in the individual groups are equal. Ho¬ 
mogeneity of variances can be checked graphically as described in Rule 
4.18, but it can also be tested statistically using for example Bartlett's 
test, Levene's test, or Fligner-Killeen's test. The tests are also meaning- 
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ful in their own right if the interest is in knowing whether population 
group variances are different. 

Bartlett's test tests the null hypothesis that the variances of the response 
variable are the same for the different groups by comparing the pooled 
variance estimate with the sum of the variances of individual groups. 
The bartlett. test function computes Bartlett's test in R. The input 
tobartlett.test should be either a numeric vector of response val¬ 
ues and a numeric or factor vector that defines the groups or a regular 
model formula where the right-hand side of the model formula defines 
the groups. 

We illustrate Bartlett's test using the ToothGrowth dataset where the 
tooth lengths are modeled as a function of the dose of vitamin C. 

> data(ToothGrowth) 

> bartlett.test(len ~ factor(dose), data=ToothGrowth) 

Bartlett test of homogeneity of variances 
data: len by factor(dose) 

Bartlett's K-squared = 0.6655, df = 2, p-value = 0.717 

Based on Bartlett's test, we fail to reject the null hypothesis of variance 
homogeneity for the three doses since the p-value is so large. 

Bartlett's test is sensitive to both departures from normality as well as 
heteroscedasticity and Levene's test is an alternative test for equality of 
variances which is more robust against departures from the normality 
assumption. Levene's test computes the absolute differences of the ob¬ 
servations in each group from the group median and then tests if these 
deviations are equal for all groups. 

The leveneTest from the car package computes Levene's test for 
homogeneity of variance across groups. leveneTest accepts the same 

input as bartlett. test. 

> library(car) 

> leveneTest(len ~ factor(dose), data=ToothGrowth) 

Levene's Test for Homogeneity of Variance (center = median) 

Df F value Pr(>F) 
group 2 0.6457 0.5281 

57 

Levene's test for homogeneity has a p-value of 0.5281 which is slightly 
smaller than the p-value from Bartlett's test but the conclusion is still 
the same: we fail to reject the hypothesis of variance homogeneity for 
the three doses. 

Instead of looking at absolute deviations from the group median we can 
look at absolute deviations from the group mean. The center option 
to leveneTest sets the function that is used to compute the absolute 
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deviations. For example, if we set center=mean then the mean will be 
used instead of the median. Generally, the median provides good ro¬ 
bustness against many types of non-normal data while retaining good 
statistical power. 

If the data are not believed to be normally distributed, then the non- 
parametric Fligner-Killeen test can be used to test the hypothesis that 
the variances in each group are the same, fligner .test computes 
Fligner-Killeen's test statistic and the function accepts the same type of 
input as bartlett. test or leveneTest. 

> fligner.test(len ~ factor(dose), data=ToothGrowth) 

Fligner-Killeen test of homogeneity of variances 
data: len by factor(dose) 

Fligner-Killeen:med chi-squared = 1.3879, df = 2, 
p-value = 0.4996 

We reach the same conclusion and fail to reject the hypothesis that the 
variances are equal. 

If there are more than one grouping factor, then all factors must be 
completely crossed before using bartlett. test, leveneTest, or 
fligner . test. The interaction function can be used to create a 
complete cross of all factors. If the ToothGrowth data is to be analyzed 
using a two-way analysis of variance, we should check for variance ho¬ 
mogeneity for the cross of dose and supplement type. 

> bartlett.test(len ~ interaction(supp, dose), data=ToothGrowth) 

Bartlett test of homogeneity of variances 
data: len by interaction(supp, dose) 

Bartlett's K-squared = 6.9273, df = 5, p-value = 0.2261 

There is no indication of variance heterogeneity when the groups de¬ 
fined by both supplement type and dose. 

See also: The oneway.test function can test equality of means for 
normally distributed data without assuming equal variances in each 
group. 


3.23 Validate a linear or generalized linear model 

Problem: You want to validate a linear or generalized linear model to 
check if the underlying assumptions appear to be fulfilled. 
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Solution: Residuals commonly form the basis for graphical model val¬ 
idation for linear and generalized linear models (see Rule 4.18). How¬ 
ever, the cumulative sum of residuals or other aggregates of residuals 
(e.g., smoothed residuals or moving sums) can be used to provide ob¬ 
jective criteria for model validation. 

The cumres function from the gof package implements model-check¬ 
ing techniques based on cumulative residuals and calculates Kolmogo- 
rov-Smirnov and Cramer von Mises test statistics based on linear and 
generalized linear models (currently cumres supports linear, logistic 
and Poisson regression models with canonical links), 
cumres takes a model object fitted by lm or glm as first argument and 
calculates the test statistic for the cumulative residuals against the pre¬ 
dicted values as well as against each of the quantitative explanatory 
variables. The R argument defaults to 500 and sets the number of sam¬ 
ples used for the simulations, and the b option determines the band¬ 
width used for moving sums of cumulative residuals. The default value 
of b=0 corresponds to an infinitely wide bandwidth, i.e., standard cu¬ 
mulated residuals. 

In the cherry tree dataset trees we seek to model the volume of the 
trees as a linear function of the height and diameter. We use the cumres 
function to check the validity of the multiple regression model. 


> library(gof) 

> data(trees) 

> attach(trees) 

> model <- lm(Volume ~ Girth + Height) # Multiple regression 

> cumres(model) 

Kolmogorov-Smirnov-test: p-value=0.09 
Cramer von Mises-test: p-value=0.012 

Based on 500 realizations. Cumulated residuals ordered 
by predicted-variable. 

Kolmogorov-Smirnov-test: p-value=0.06 
Cramer von Mises-test: p-value=0.004 

Based on 500 realizations. Cumulated residuals ordered 
by Girth-variable. 

Kolmogorov-Smirnov-test: p-value=0.256 
Cramer von Mises-test: p-value=0.254 

Based on 500 realizations. Cumulated residuals ordered 
by Height-variable. 


The output shows quite low p-values for both the predicted and girth 
variables which suggests that a model should be found. An improved 
model might be based on the formula for the volume of a geometric 
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cone shape, which suggests we log transform both the volume, girth 
and height. 

> lVol <- log(Volume) # Log transform variables 

> lGir <- log(Girth) 

> lHei <- log(Height) 

> model2 <- lm(lVol ~ lGir + lHei) # New model 

> cumres(model2) 

Kolmogorov-Smirnov-test: p-value=0.584 
Cramer von Mises-test: p-value=0.582 

Based on 500 realizations. Cumulated residuals ordered 
by predicted-variable. 

Kolmogorov-Smirnov-test: p-value=0.598 
Cramer von Mises-test: p-value=0.508 

Based on 500 realizations. Cumulated residuals ordered 
by lGir-variable. 

Kolmogorov-Smirnov-test: p-value=0.516 
Cramer von Mises-test: p-value=0.602 

Based on 500 realizations. Cumulated residuals ordered 
by lHei-variable. 


We see that the p-values for the cumulative residual goodness-of-fit 
statistics have improved substantially with this transformed multiple 
regression model. 

The gof package also extends the generic plot function to produce 
plots of the cumulative residuals. The plots show the cumulated residu¬ 
als, a sample of realizations from the asymptotic distribution (the num¬ 
ber of samples is determined by the plots option to cumres), and 
the corresponding 95% confidence interval. The level option sets the 
confidence level when plotting cumres objects and the default value 
is 0.95. Note that the plot function creates one plot for the predicted 
values as well as one for each of the quantitative explanatory variables. 
The following code is used to plot the cumulative residuals and the re¬ 
sult for two of the three plots are shown in Figure 3.9. 

> cr2 <- cumres(model2) # Save the cumulated residuals 

> par(mfrow=c(2,2)) # Allow 2x2 plots in the same Figure 

> plot(cr2) 

The observed cumulated residuals are well within the confidence band 
in both plots, so there is nothing here that suggests the transformed 
model is inadequate. 

See also: The cox.aalen function in the timereg package imple¬ 
ments a similar method for survival data. 
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Figure 3.9: The left panel shows the cumulated residuals (black line), 50 sam¬ 
ple realizations (dark gray lines) and the corresponding 95% confidence inter¬ 
val (light gray area) against the predicted values. The right panel shows the 
cumulated residuals against logarithm of tree girth. 


Contingency tables 


3.24 Analysis of two-dimensional contingency tables 

Problem: You wish to test for homogeneity or independence in a two- 
dimensional contingency table. 

Solution: A two-dimensional contingency table is a table of counts 
that is formed by classifying observations by two categorical variables, 
where one variable determines the row categories and the other vari¬ 
able defines the column categories. The observations are assumed to 
be independent. R uses the chisq.test function for chi-squared tests 
and fisher, test function for exact tests in two-dimensional contin¬ 
gency tables. Both functions accept a contingency table as input. 

As an example, we use the following data from the county of Arhus in 
Denmark where the concentration of the pesticide dichlorobenzamide 
(also called BAM) in drinking water was examined from two differ¬ 
ent municipalities within the county. The allowable limit of BAM is 
0.10/ig/l, and the data are summarized below. 
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Concentration (in 

Mg/D 


Municipality 

< 0.01 

0.01-0.10 

> 0.10 

Total 

Hadsten 

23 

12 

6 

41 

Hammel 

20 

5 

9 

34 


We can test for independence between the two municipalities using 
both the chi-square and exact test with the following code. 

> m <- matrix(c(23, 20, 12, 5, 6, 9), ncol=3) # Input data 

> m 

[,1] [,2] [, 3] 

[1,] 23 12 6 

[2,] 20 5 9 

> chisq.test (m) # Chi-square test 

Pearson's Chi-squared test 


data: m 

X-squared = 3.065, df = 2, p-value = 0.216 

> fisher.test(m) # Fisher's exact test 

Fisher's Exact Test for Count Data 


data: m 

p-value = 0.2334 

alternative hypothesis: two.sided 

The chi-square and exact analyses show virtually the same results and 
the conclusion is that there is no difference in the distribution of BAM 
between the Hadsten and Hammel areas. 

Additional options are relevant for 2x2 contingency tables. By de¬ 
fault, the chisq. test function applies a continuity correction for an 
improved test statistic unless correct=FALSE is specified. For 2x2 
contingency tables, we also have the possibility to specify the alterna¬ 
tive hypothesis for the fisher . test function with the alternative 
option. The default is two. sided and other possibilities are that the 
true odds ratio is either greater and less for one-sided alternatives. 
If we collapse the table above such that concentrations below 0.10/tg/l 
are grouped together we can analyze the resulting 2x2 contingency 
table as follows: 

> m2 <- matrix(c(35, 25, 6, 9), ncol=2) # Input data 

> chisq.test(m2) 

Pearson's Chi-squared test with Yates' continuity 
correction 

data: m2 

X-squared = 0.9718, df = 1, p-value = 0.3242 
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> chisq.test(m2, correct=FALSE) # No continuity correction 

Pearson's Chi-squared test 
data: m2 

X-squared = 1.6275, df = 1 , p-value = 0.2020 

> fisher.test(m2, alternative="greater") # One-sided alternative 

Fisher's Exact Test for Count Data 

data: m2 

p-value = 0.1622 

alternative hypothesis: true odds ratio is greater than 1 
95 percent confidence interval: 

0.6883204 Inf 

sample estimates: 
odds ratio 
2.078964 

The first example shows how Yates' continuity correction automatically 
is used with 2x2 tables, and in this instance the conclusions from the 
analyses with and without the continuity correction are the same. The 
one-sided exact test shows homogeneity between the two areas when 
we only consider the one-sided alternative hypothesis. 

The chisq. test function prints a warning message if the y~-approxi- 
mation of the test statistic may be incorrect. In those situations it may 
be necessary to collapse categories or just generally use the exact test. 


3.25 Analyze contingency tables using log-linear mod¬ 
els 

Problem: You wish to test for independence in a multi-dimensional 
contingency table. 

Solution: Contingency tables can be analyzed as a log-linear model 
where the dependency between categorical variables is explained. All 
k categorical variables used to define the contingency table are treated 
as responses and the table shows their joint distribution. A multino¬ 
mial (or product multinomial) model is appropriate for describing the 
joint distribution of the categorical variables when the total number of 
observations (or the total number of observations for specific subsets) 
are considered fixed. However, the appropriate multinomial model 
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happens to be equivalent to a Poisson generalized linear model with 
the appropriate terms included as explanatory variables so we can use 
Poisson regression with a log link (see Rule 3.10) to analyze the counts 
in the table as realizations from a Poisson random variable. 

Testing the hypothesis of complete independence in the multinomial 
model is equivalent to comparing the additive Poisson model to the 
fully saturated model (i.e., the model where the interaction between all 
variables are included) since the additive Poisson model corresponds 
to independence among all categorical variables. 

The anova function can test specific hypotheses for the log-linear Pois¬ 
son model, and the option test = "Chisq" should be set to obtain p- 
values from the likelihood-ratio test statistic. 

We use the survey data frame from the MASS package to investigate 
three of the categorical traits reported by 237 students. The xtabs and 
as . data. frame functions are used to transform the responses to a 
data frame that contains the frequency counts. In this example we con¬ 
sider three categorical variables: the gender of the student, which hand 
is on top when hands are clapped (Clap with levels "Right", "Left", 
and "Neither"), and how frequently the student exercises (Exer with 
levels "Freq", "Some", and "None"). 

> library(MASS) 

> data(survey) 

> mydata <- as.data.frame(xtabs(~ Sex + Clap + Exer, 

+ data=survey)) 


> head(mydata) 



Sex 

Clap 

Exer 

Freq 

1 

Female 

Left 

Freq 

11 

2 

Male 

Left 

Freq 

8 

3 

Female 

Neither 

Freq 

17 

4 

Male 

Neither 

Freq 

16 

5 

Female 

Right 

Freq 

21 

6 

Male 

Right 

Freq 

41 


> full <- glm(Freq ~ Sex*Clap*Exer, family=poisson, data=mydata) 

> indep <- glm(Freq ~ Sex+Clap+Exer, family=poisson, data=mydata) 

> anova(indep, full, test="Chisq") 

Analysis of Deviance Table 


Model 1: Freq ~ Sex + Clap + Exer 
Model 2: Freq ~ Sex * Clap * Exer 

Resid. Df Resid. Dev Df Deviance P(>|Chi|) 

1 12 26.676 

2 0 0.000 12 26.676 0.0086 


* * 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

We reject the hypothesis of complete independence in the contingency 
table since likelihood-ratio test gives a p-value of 0.0086. 
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Formulating the contingency table as log-linear model enables us to test 
for block and conditional independence among the categorical vari¬ 
ables. Block independence refers to the situation where the distribu¬ 
tion of the categorical variables factors into disjoint blocks such that 
variables in one block are independent of the variables in other blocks. 
There are several possible hypotheses corresponding to different sets of 
variables, and here we will just look at one of them. The hypothesis of 
block independence is tested by comparing the fit of the blocked model 
to the full, saturated model. 

> blockl <- glm(Freq ~ Sex + Clap*Exer, family=poisson, 

+ data=mydata) 

> anova(blockl, full, test="Chisq") 

Analysis of Deviance Table 

Model 1: Freq ~ Sex + Clap * Exer 
Model 2: Freq ~ Sex * Clap * Exer 

Resid. Df Resid. Dev Df Deviance P(>|Chi|) 

1 8 13.714 

2 0 0.000 8 13.714 0.08954 . 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

We fail to reject the hypothesis that the gender of the student is inde¬ 
pendent of hand clapping and exercise (p = 0.08954). 

Conditional independence refers to the situation where two (or more) 
categorical variables are independent of each other conditional on a 
third variable. For example, if gender and exercise are associated and 
hand clapping and exercise are associated then that corresponds to the 
log-linear model where there are interactions between gender and exer¬ 
cise, as well as clapping and exercise as shown in the following. Gender 
and clapping are both dependent on exercise but there is no interaction 
between gender and clapping so there is no direct effect between those 
two variables. 

> condl <- glm(Freq ~ Sex*Exer+Clap*Exer, family=poisson, 

+ data=mydata) 

> anova(condl, full, test="Chisq") 

Analysis of Deviance Table 

Model 1: Freq ~ Sex * Exer + Clap * Exer 
Model 2: Freq ~ Sex * Clap * Exer 

Resid. Df Resid. Dev Df Deviance P(>|Chi|) 

1 6 7.5523 

2 0 0.0000 6 7.5523 0.2728 

We test the conditional independence model against the fully saturated 
model and fail to reject the hypothesis that gender and hand clapping 
are conditionally independent given exercise. 
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See also: Rule 2.18 shows how to convert a contingency table to a data 
frame. Alternatively, contingency tables can often be analyzed as lo¬ 
gistic regression (see Rule 3.8), or multinomial logistic regression (see 
Rule 3.9) depending on the statistical design. 


Agreement 


3.26 Create a Bland-Altman plot of agreement to com¬ 
pare two quantitative methods 

Problem: You want to create a Bland-Altman plot to evaluate the 
agreement between quantitative measurements from two different meth¬ 
ods or between two raters. 

Solution: The Bland-Altman plot (also called Tukey's mean difference 
plot) can be used to compare how well two different quantitative mea¬ 
surement techniques agree. The Bland-Altman plot can also be used to 
compare a new measurement method with a gold standard or reference 
value. 

In the Bland-Altman plot, the differences between the two methods are 
plotted against their average value and from the resulting scatter plot 
we can determine the bias (i.e., the average difference), the limits of 
agreement, as well as if the underlying assumption of constant vari¬ 
ability is fulfilled (if not, we may see an increase in the scatter of the 
differences, as the magnitude of the measurements increases). 

The Bland-Altman plot is easily created with the following few lines, 
where hplc are measurements of muconic acid in human urine ob¬ 
tained from High Performance Liquid Chromatography (HPLC) from 
11 samples and gems are the corresponding measurements on the same 
11 samples obtained by Gas Chromatography-Mass Spectrometry (GC- 
MS). We wish to compute 95% limits of agreement for the two methods. 

> hplc <- c(139, 120, 143, 496, 149, 52, 184, 190, 32, 312, 19) 

> gems <- c(151, 93, 145, 443, 153, 58, 239, 256, 69, 321, 8) 

> average <- (hplc + gcms)/2 # Compute average 

> dif <- (hplc - gems) # and difference 

> plot(average, dif, ylim=c(-80,80), # Plot diff vs average 

+ xlab="Average", ylab="Difference") 

> # Calculate limit for 95% agreement 

> limit <- qnorm(.975) 
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Figure 3.10: Bland-Altman plot of agreement for two methods. The solid line 
is average bias, dotted lines are 95% limits of agreement, and the dashed line 
shows bias as function of magnitude. 


> # Add average bias and 95% limits of agreement 

> abline(h=mean(dif)+c(-limit,0,limit)*sd(dif),lty=c(3,1,3)) 

> # Add line showing bias as function of magnitude 

> abline(lm(dif-average) , lty=2) 


Figure 3.10 shows the resulting Bland-Altman plot. The limits of agree¬ 
ment tells us how far apart measurements by the two methods are likely 
to be for most items and we can use these limits to determine if the 
difference between methods is acceptable. The regression line of differ¬ 
ence on average (shown in Figure 3.10 as the dashed line) can be used to 
check the assumption that the bias is constant with magnitude. In the 
present case there is a slightly positive slope which indicates that the 
bias may increase with magnitude, but it is not significant. If we wish 
to look at other limits of agreement than 95%, then we should change 
the corresponding quantile in the call to qnorm above. 

See also: Rule 3.27 considers agreement of a quantitative measurement 
among more than two raters and Rule 3.28 computes an index that mea¬ 
sures interrater agreement for categorical items. The BlandAltman 
function from the MethComp package can also create Bland-Altman 
plots. 


www.Ebook777.com 






144 


The R Primer 


3.27 Determine agreement among several methods of a 
quantitative measurement 

Problem: You want to evaluate the agreement among several different 
methods /raters for a quantitative measurement even in the presence of 
repeated measurements on each item. 

Solution: The standard Bland-Altman plot only considers two dif¬ 
ferent measurement methods. If we wish to determine the agreement 
among several methods /raters or we have repeated measurements of 
each item (for each method/rater), then we can use a linear mixed 
model to estimate both the repeatability and the agreement among meth¬ 
ods. 

The MethComp package provides methods for comparison of quan¬ 
titative measurement methods /raters for the situation where specific 
methods/raters are of interest. The functions in the MethComp package 
are wrappers that set up the correct mixed-effects models, fits them and 
presents the results in a nice compact manner. The MethComp package 
fits a linear mixed model with item and method/rater as fixed effects 
and interactions between method and item (and possibly also between 
replicate and item if replicates are linked across methods) as random 
effects. Furthermore, the model allows the different methods to have 
individual variances corresponding to different precision of the meth¬ 
ods. 

The Meth function sets up a data frame in the format required by the 
functions in the MethComp package. Specifically, the data frame should 
contain columns entitled y, meth, item, and repl, corresponding to a 
vector of numeric measurements, a factor of method/raters, a factor of 
items, and a factor to identify replicates as input. Alternatively, if a data 
frame is added as the first argument to Meth then we can just specify 
the column numbers that constitute the relevant variables in the data 
frame. 

The BA. est estimates and extracts the relevant variance components 
and computes the limits of agreement between all methods for the sit¬ 
uation where the bias between methods is assumed to be constant. 
BA. est expects a data frame of the form produced by the Meth func¬ 
tion as input and the result from BA.est can be plotted to obtain all 
pairwise Bland-Altman plots. The Transform argument can be set to a 
function used to transform the response prior to analysis. If the desired 
transformation is given in the call to BA.est then the plotting func¬ 
tions in the Meth package automatically back-transforms the results to 
the original scale. 
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The option linked for BA.est defaults to TRUE, which means that 
the replicates within each item are measured in parallel across methods 
(i.e., the random interaction between item and replicate is included in 
the model). If linked=FALSE then the replicates are assumed to be 
exchangeable within items. The alpha option sets the desired signifi¬ 
cance level and defaults to 0.05 so 95% prediction limits are computed. 
The rainman data frame from the kulife package contains guesses 
from 5 raters of the number of points in 30 pictures (10 items which 
were — unbeknownst to the raters — each replicated three times). We 
wish to determine the agreement among raters and the repeatability 
since we have replicates of each item. The logarithm of the estimated 
number of points is used in the analysis since it is believed that the 
variance increases with increasing number of points in the picture. For 
brevity we will just compare three of the raters. The resulting plot is 
shown in Figure 3.11. 

> library(MethComp) 

> library(kulife) 

> data(rainman) 

> head(rainman) 



SAND 

ME 

TM 

AJ 

BM 

LO 

1 

120 

175 

120 

105 

100 

100 

2 

48 

50 

50 

45 

50 

70 

3 

CO 

CO 

150 

75 

75 

60 

80 

4 

32 

45 

22 

28 

30 

30 

5 

24 

25 

22 

25 

20 

20 

6 

100 

125 

80 

91 

80 

70 


> # Use column 1 to define items, and values from columns 3 to 5 

> # Meth automatically use the original column names for the 

> # methods and creates the replicate vector 

> mydata <- Meth(item=l, y=3:5, data=rainman) 

> result <- BA.est(mydata, Transform="log") # Use log transform 

> result 


Note: Response transformed by: .Primitive("log") 


Conversion between methods: 


To: 

From: 

alpha 

beta 

sd 

LoA: lower 

upper 

AJ 

AJ 

0.000 

1.000 

0.188 

-0.376 

0.376 


BM 

0.005 

1.000 

0.231 

-0.456 

0.466 


TM 

-0.090 

1.000 

0.210 

-0.511 

0.330 

BM 

AJ 

-0.005 

1.000 

0.231 

-0.466 

0.456 


BM 

0.000 

1.000 

0.241 

-0.481 

0.481 


TM 

-0.095 

1.000 

0.221 

-0.537 

0.347 

TM 

AJ 

0.090 

1.000 

0.210 

-0.330 

0.511 


BM 

0.095 

1.000 

0.221 

-0.347 

0.537 


TM 

0.000 

1.000 

0.199 

-0.399 

0.399 


Variance components (sd): 
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IxR Mxl res 
AJ 0.05 0.081 0.133 
BM 0.05 0.000 0.170 
TM 0.05 0.000 0.141 

The output gives the limits of agreement between all pairs of meth¬ 
ods. The agreement between methods is considered acceptable if the 
variability between observations made with different methods on the 
same subject is not much larger than the variability between observa¬ 
tions with the same method on this subject. 

The bias between methods /raters is listed in the alpha column of the 
output and limits of agreements on the transformed scale are shown in 
the last two columns. The output lines where "To" and "From" corre¬ 
spond to the same method /rater are the estimated repeatability predic¬ 
tion intervals. Thus, rater "BM" has an average bias of 0.095 relative to 
rater "TM" (i.e., BM scores exp(—0.095) = 0.91 times "TM" so about 9% 
lower), and the limits of agreement for the ratio between the two raters 
are [exp(—0.347); exp(0.537)] = [0.707; 1.711]. Rater "TM" has a repeata¬ 
bility prediction interval for the ratio that goes from exp(—0.399) = 0.67 
toexp(0.399) = 1.49. 

The estimated variance components used for the calculations are shown 
at the end of the output. Note they are on the transformed scale. 

If the plot function is used on the data frame created by the Meth func¬ 
tion, then pairwise Bland-Altman plots (on the original scale) between 
all methods are created. To create a plot of the conversion between 
methods based on the BA. e st result we plot the BA. e st output object 
with argument pi ,type="BA". The argument points=TRUE should 
be set to plot the individual points, and the wh. cmp argument should 
be a vector of length 2 that determines which two methods that are com¬ 
pared. The output from the following two lines are seen in Figure 3.11. 

> plot(mydata) 

> plot(result, wh.cmp=c(1,2), points=TRUE, pi.type="BA") 

In the right graph of Figure 3.11 the limits of agreement are automati¬ 
cally transformed back to the original scale. 

BA. est assumes that the raters are fixed in the analysis. Flowever, in 
some situations the raters are a random sample of raters from a larger 
group and it may not be of interest to describe the bias between the 
particular raters in the sample. Instead we want to consider raters in 
general. If we assume that the variances for each rater are the same, 
then the proper mixed effects model is easily fitted using lmer. We 
include an interaction between method/rater and item and an interac¬ 
tion between replicate and item (if the replicates are not believed to be 
exchangeable). The limits of agreement for the difference between two 
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20 60 100 20 60 100 



Figure 3.11: The left graph shows Bland-Altman agreement plots for three 
methods/raters. The upper panels show Bland-Altman plots with limits of 
agreement. The lower panels show pairwise scatter plots (with identity line 
included). The right graph shows the limits of agreements between two of the 
raters. 


random raters is then calculated from twice the sum of the variance 
components for the method, residual, and interaction between method 
and item random effects. This is done below, where the output from 
lmer is slightly abbreviated. 


> library(lme4) 

> mi <- interaction(mydata$meth, mydata$item) 

> ri <- interaction(mydata$repl, mydata$item) 

> lmer(log(y) ~ item + (l|meth) + (l|mi) + (l|ri), data=mydata) 
Linear mixed model fit by REML 

Formula: log(y) ~ item + (1 | meth) + (1 | mi) + (1 | ri) 

Data: mydata 

AIC BIC logLik deviance REMLdev 
-14.13 20.87 21.06 -80.8 -42.13 

Random effects: 

Groups Name Variance Std.Dev. 

ri (Intercept) 0.0012660 0.035581 

mi (Intercept) 0.0000000 0.000000 

meth (Intercept) 0.0020532 0.045312 

Residual 0.0245641 0.156729 

Number of obs: 90, groups: ri, 30; mi, 30; meth, 3 

> c(-1,1)*1.96*sqrt(2*(0.0000 + 0.0020532 + 0.0245641)) 

[1] -0.4522234 0.4522234 

Thus the average limits of agreement for two random raters (assuming 
that all raters have the same variance and that replicates are not ex- 
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changeable) are [—0.452; 0.452], This is on the log scale so the limits of 
agreement for the ratio between two random raters are [0.636; 1.571]. 
See also: Rule 3.13 introduces linear mixed effect models and Rule 3.12 
introduces lmer. 


3.28 Calculate Cohen's kappa 

Problem: You wish to calculate Cohen's kappa as a measure of inter¬ 
rater agreement between two raters /methods on categorical (or ordi¬ 
nal) data. 

Solution: Cohen's kappa, k, is a measure of inter-rater agreement (or 
concordance) between two raters /methods who classify each of n ob¬ 
servations into k categories, k = 1 when there is perfect agreement and 
k = 0 when there is no agreement. 

The irr package provides the kappa2 function, which calculates Co¬ 
hen's kappa for two raters. The input to kappa2 should be a matrix or 
data frame with n rows — one for each item — and two columns repre¬ 
senting the ratings or categories of the two raters. The two columns of 
ratings are internally converted to factors when the kappa measure is 
calculated and the categories are therefore ordered according to factor 
levels. 

The anxiety dataset from the irr package contains anxiety ratings 
from 20 subjects, rated by 3 raters. Values range from 1 (not anxious at 
all) to 6 (extremely anxious). Initially, we just consider raters one and 
two. 


> library(irr) 

> data(anxiety) 

> head(anxiety) 
raterl rater2 

13 3 

2 3 6 

3 3 4 

4 4 6 

5 5 2 

6 5 4 


rater3 

2 

1 

4 

4 

3 

2 


> kappa2(anxiety[, 1:2] ) # Compute kappa for raters 
Cohen's Kappa for 2 Raters (Weights: unweighted) 


1,2 


Subjects = 20 
Raters = 2 
Kappa = 0.119 
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The estimated kappa value is 0.119 which suggests poor agreement be¬ 
tween raters one and two and the hypothesis of setting kappa equal to 
zero has a test statistic of 1.16 with a corresponding p-value of 0.245. 
Cohen's kappa treats classifications as nominal but when the categories 
are ordered the seriousness of a disagreement depends on the differ¬ 
ence between ratings. The kappa2 function uses an unweighted coef¬ 
ficient (i.e., all disagreements between raters are scored the same way 
regardless of the categories selected) by default which is meaningful 
for nominal categories. For ordered categories, a weighted kappa mea¬ 
sure should be used and kappa2 provides two possibilities through 
the weight option: weight = "equal" (when weights depend linearly 
on their distance between categories) and weight = " squared" (when 
distance between categories is squared to compute the weights). In the 
anxiety dataset the categories are ordered, and we could use, for ex¬ 
ample, squared weights: 

> kappa2(anxiety[,1:2], weight="squared") 

Cohen's Kappa for 2 Raters (Weights: squared) 

Subjects = 20 
Raters = 2 
Kappa = 0.297 

z = 1.34 
p-value = 0.18 

The squared weights suggest fair agreement between the two raters al¬ 
though it is not significantly different from no agreement (p = 0.18). 
Cohen's kappa only handles two raters. When there are more than two 
raters, we can use Fleiss' kappa to estimate the agreement among raters. 
It is assumed for Fleiss' kappa that each item has been rated by a fixed 
number of raters but it is not necessary for all raters to rate all items 
(e.g., item 1 could be rated by raters 1, 2, and 3 while item 2 could be 
rated by raters 2, 4, and 9, etc.) Fleiss' kappa is a generalization of the 
unweighted kappa, and does not take any weighting of disagreements 
into account. Perfect agreement results in a value of 1 while no agree¬ 
ment yield a Fleiss' kappa value < 0. The kappam. fleiss function 
from the irr package calculates Fleiss' kappa and takes the same type 
of input as kappa2 : a matrix or data frame with n rows — one for each 
item — and m columns representing the available ratings for each item. 
Setting the option exact=TRUE calculates a version of Fleiss' kappa 
that is identical to Cohen's kappa in the case of two raters. 

> kappam.fleiss(anxiety) 
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Fleiss' Kappa for m Raters 

Subjects = 20 
Raters = 3 
Kappa = -0.0411 

z = -0.634 
p-value = 0.526 

Based on Fleiss' kappa we get a value of —0.0411 which is less than zero 
so we conclude that there is no agreement (other than what would be 
expected by chance) among the three raters in the anxiety dataset. 

See also: The kappam. light function from the irr package computes 
Light's kappa as a measure of interrater agreement. If data are available 
in the form of a contingency table, then the clas sAgreement function 
from the e 10 71 package can calculate Cohen's kappa. 


Multivariate methods 


3.29 Fit a multivariate regression model 

Problem: You wish to fit a multivariate linear regression model to two 
or more dependent quantitative response variables. 

Solution: Multivariate regression is an extension of linear normal 

models to the situation where there are two or more dependent re¬ 
sponse variables and where each response variable can be modeled by 
a linear normal model. Multivariate analysis-of-variance refers to the 
special case where the explanatory variable(s) are categorical. Multi¬ 
variate analysis of variance enables us to test hypotheses regarding the 
effect of the explanatory variable(s) on two or more dependent vari¬ 
ables simultaneously while maintaining the desired magnitude of type 
I error. 

In R, the lm function can be used for multivariate regression modeling 
simply by letting the response variable be a matrix of responses, where 
each column corresponds to a response variable. 

In the following example we wish to model how the ozone, temper¬ 
ature, and the solar radiation are influenced by the day number for 
the airquality dataset. We model the set of response variables as 
a quadratic function of the day number. 
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> data(airquality) 

> attach(airquality) 

> day <- as.Date(paste("1973", Month, Day, sep="-")) 

> dayno <- julian(day, origin = as.Date("1973-01-01")) 

> head(dayno) 

[1] 120 121 122 123 124 125 

> model <- lm(cbind(Ozone, Temp, Solar.R) ~ dayno + I(dayno**2)) 

> model 


Solar.R) ~ dayno + 


Call: 

lm(formula = 

Coefficients: 

(Intercept) 

dayno 

I(dayno A 2) 


cbind(Ozone 


Ozone 

-2.6 62e+02 
3.205e+00 
-7.903e-03 


Temp 


Temp 
-5.456e+01 
1.334e+00 
-3.196e-03 


Solar.R 
3.383e+00 
2.154e+00 
-5.909e-03 


I(dayno A 2)) 


If we use summary on the result from the lm call we will see estimates, 
standard errors and corresponding p-values for the individual analyses 
from each of the response variables. That output is not different from 
making separate calls to lm for ozone, temperature and solar radiation. 

The test of the overall significance of each variable is computed when 
the anova function is used on the multivariate lm object. There ex¬ 
ist several test statistics for multivariate regression model and by de¬ 
fault, R computes Pillai's trace (also called Pillai-Bartlett's trace). Other 
test statistics like Wilks' lambda, Hotelling-Lawleys trace and Roy's 
maximum root are chosen if the test argument for anova is set to 
"Wilks", "Hotelling-Lawley", or "Roy ", respectively. 


> anova(model) 

Analysis of Variance Table 

Df Pillai approx F num Df den Df Pr(>F) 
(Intercept) 1 0.99485 6829.3 3 106 < 2.2e-16 *** 

dayno 1 0.29319 14.7 3 106 4.718e-08 *** 

I(dayno A 2) 1 0.44344 28.2 3 106 1.812e-13 *** 

Residuals 108 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 


If we look at the individual analyses, it turns out that day number is 
not significant when we look at the model with solar radiation as re¬ 
sponse. However, Pillai's simultaneous test for all three response vari¬ 
ables shows that both day number and the quadratic polynomial of day 
number are highly significant with p-values very close to zero. 
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3.30 Cluster observations 

Problem: You have a dataset and want to identify clusters or groups 
of observations that are similar in some sense. 

Solution: Cluster analysis is concerned with assignment of a set of 
observations into groups or clusters so that observations in the same 
cluster resemble each other in some sense. 

There exists numerous packages and functions for cluster analysis in 
R. Here we will just focus on two: kmeans for unsupervised means 
clustering and hclust for hierarchical clustering. 

The first argument to kmeans should be a numeric matrix of data or 
an object that can be coerced to a numeric matrix. The rows repre¬ 
sent the observations to be clustered and the columns should contain 
the variables that are used to determine if the observations are similar. 
In /."-means clustering the investigator specifies the desired number of 
clusters and that is done by the centers argument in R. centers 
should either be an integer that determines the number of clusters or a 
matrix that sets the initial cluster centers. If only a number is given, then 
the initial centers are sampled at random from the dataset, and slightly 
different results may be obtained depending on the initial centers. The 
algorithm argument sets the algorithm used to partition points into 
the clusters (default is Hartigan-Wong and alternative methods com¬ 
prise Lloyd, Forgy, and MacQueen). 

The agriculture data frame from the cluster package contains in¬ 
formation on the Gross National Product (GNP) per capita (x) and per¬ 
centage of the population working in agriculture (y) for each country 
belonging to the European Union in 1993. 

> library(cluster) 

> data(agriculture) 

> fit2 <- kmeans(agriculture, centers=2) # 2 cluster solution 

> fit2 

K-means clustering with 2 clusters of sizes 4, 8 

Cluster means: 

x y 

1 9.000 16.1250 

2 17.825 4.5625 

Clustering vector: 

B DK D GR E F IRL I L NL P UK 
222112122212 

Within cluster sum of squares by cluster: 

[1] 90.76750 71.91375 
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Available components: 

[1] "cluster" "centers" "withinss" "size" 

The output from kmeans gives information on the cluster sizes, their 
respective means, the resulting allocation of each row and the within 
cluster sum of squares. 

We can calculate the total within cluster sum of squares for a different 
number of clusters with the following command if we want to use that 
to decide on how many clusters to use. 

> # Calculate the total within sum of squares for different 

> # number of clusters (here 2-6) 

> sapply(2:6, function(i) { 

sum(kmeans(agriculture, centers=i)$withinss)}) 

[1] 162.68125 131.41417 58.90167 36.20000 28.86500 

If the variables are measured on very different scales, then the vari¬ 
ables in the data matrix may need to be standardized before clustering 
is undertaken. The scale function standardizes the columns in a data 
matrix by subtracting the mean and dividing by the standard deviation. 

> scale.agriculture <- scale(agriculture) 

> fit2s <- kmeans(scale.agriculture, centers=2) 

> fit2s 

K-means clustering with 2 clusters of sizes 4, 8 

Cluster means: 

x y 

1 -1.1869916 1.1963863 

2 0.5934958 -0.5981932 

Clustering vector: 

B DK D GR E F IRL I L NL P UK 
222112122212 

Within cluster sum of squares by cluster: 

[1] 2.525810 2.432453 

Available components: 

[1] "cluster" "centers" "withinss" "size" 

> sapply(2:6, function(i) { 

+ sum(kmeans(scale.agriculture, centers=i)$withinss)}) 

[1] 4.958264 2.904270 1.662435 1.059511 1.190618 

Hierarchical clustering requires a distance matrix to determine the sim¬ 
ilarity between observations. The di st function computes the pairwise 
distances between the rows of a data matrix. By default, dist uses Eu¬ 
clidean distance but other metrics can be set with the method argument 
(for example method= "maximum" or method="manhattan"). 
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When the pairwise distances are computed, the resulting matrix can 
be used as argument to the hclust function to undertake hierarchical 
clustering. The hclust also has a method argument that determines 
how the distance between two sub-clusters are measured. Some of the 
possible grouping methods are complete (the default), average, and 
single. The help page provides some additional information on these 
and other clustering methods. 

We can extract classification from a hierarchical cluster analysis by cut¬ 
ting the tree at a certain level using the function cutree. cutree 
requires a tree as first argument and either the option k which speci¬ 
fies the desired number of clusters or h which determines the height at 
which to cut the tree. The resulting clusters are printed by cutree. 

Figure 3.12 shows the output dendrograms from hclust obtained by 
using plot on the hierarchical cluster analysis object. 


> distances <- dist(agriculture) 

> hierclust <- hclust(distances) 

> hierclust 


Call: 

hclust(d = distances) 


Cluster method : complete 

Distance : euclidean 

Number of objects: 12 

> hierclust.single <- hclust(distances, method="single") 


> plot(hierclust) 

> plot(hierclust.single) 

> cutree(hierclust, k=2) 

B DK D GR E F IRL I 

11122121 

> cutree(hierclust.single, h=5) 

B DK D GR E F IRL I 

11123131 


# Cut tree to get 2 groups 

L NL P UK 

112 1 

# Cut at height 5 

L NL P UK 

113 1 


Here the same two clusters are found as from fc-means clustering. 

See also: The cluster package includes several clustering methods 
that are more advanced than fc-means and hierarchical clustering as 
well as additional methods for plotting cluster results. The mclust 
package provides functions for model based clustering. See Rule 4.15 
on how to draw heat maps. 
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Figure 3.12: The left panel shows the hierarchical clustering dendrogram given 
by the he lust function. The right graph shows the same dendrogram but 
computed using the option method= single. 


3.31 Use principal component analysis to reduce data di¬ 
mensionality 

Problem: You want to use principal component analysis to reduce data 
dimensionality. 

Solution: Principal component analysis is a useful technique when 
you have obtained measurements on a number of (possibly correlated) 
observed variables and you wish to reduce the number of observed 
variables to a smaller number of artificial variables (called principal 
components) that account for most of the variance in the observed vari¬ 
ables. The principal components may then be used as predictors in 
subsequent analyses. 

The pr comp function performs principal component analysis on a given 
data matrix. By default, R centers the variables in the data matrix, but 
does not scale them to have unit variance. In order to scale the variables 
we should set the logical option scale . =TRUE. 

In the following code we will make a principal component analysis on 
the nine explanatory variables found in the biopsy dataset from the 
MAS S package. There are missing data present in the data frame so we 
restrict our attention to the complete cases. 

> library(MASS) 

> data(biopsy) 

> names(biopsy) 
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[1] "ID" "VI" "V2" "V3" "V4" "V5" "V6 

[8] "V7" "V8" "V9" "class" 

> predictors <- biopsy[complete.cases(biopsy),2:10] 

> fit <- prcomp(predictors, scale=TRUE) 

> summary (fit) 

Importance of components: 




PCI 

PC2 

PC3 PC4 

PC5 

PC6 

Standard deviation 

2.429 0 

.8809 0 

.7343 0.6780 

0.6167 

0.5494 

Proportion 

of Variance 

0.655 0 

.0862 0 

.0599 0.0511 

0.0423 

0.0335 

Cumulative 

Proportion 

0.655 0 

.7417 0 

.8016 0.8527 

0.8950 

0.9285 



PC7 

PC8 

PC9 



Standard deviation 

0.5426 

0.5106 

0.29729 



Proportion 

of Variance 

0.0327 

0.0290 

0.00982 



Cumulative 

Proportion 

0.9612 

0.9902 

1.00000 




> plot(fit) 

> biplot(fit) 

The first principal component (PCI) explains 65.5% of the total varia¬ 
tion, the second explains 8.62%, etc. The plot and biplot functions 
take the output from prcomp and produce a scree plot and biplot re¬ 
spectively, and their output can be seen in Figure 3.13. The scree plot 
clearly suggests that only that first principal component has a major 
contribution to the variance. From the biplot we see that the variables 
(indicated by the nine arrows) are quite correlated since most of them 
have virtually the same direction, and that they contribute more or less 
the same to the first principal component. Only variable 9, V9, has any 
real contribution to the second principal component. 

The loadings factors can be found as the rotation element of the 
model fit object and the pea scores (the original data in the rotated coor¬ 
dinate system) are obtained either from the x component of the model 
fit or by the predict function. 

> fit 

Standard deviations: 


[1] 

2.4288885 

0.8808785 0 

.7343380 0.6779583 0.6166651 0.5494328 

[7] 

0.5425889 

0.5106230 0 

.2972932 



Rotation: 






PCI 

PC2 

PC3 

PC4 

PC5 

VI 

0.3020626 

0.14080053 ■ 

-0.866372452 

0.10782844 

-0.08032124 

V2 

0.3807930 

0.04664031 

0.019937801 - 

0.20425540 

0.14565287 

V3 

0.3775825 

0.08242247 ■ 

-0.033510871 - 

0.17586560 

0.10839155 

V4 

0.3327236 

0.05209438 

0.412647341 

0.49317257 

0.01956898 

V5 

0.3362340 - 

■0.16440439 

0.087742529 - 

0.42738358 

0.63669325 

V6 

0.3350675 

0.26126062 ■ 

-0.000691478 

0.49861767 

0.12477294 

V7 

0.3457474 

0.22807676 

0.213071845 

0.01304734 

-0.22766572 

V8 

0.3355914 - 

■0.03396582 

0.134248356 - 

0.41711347 

-0.69021015 

V9 

0.2302064 - 

■0.90555729 ■ 

-0.080492170 

0.25898781 

-0.10504168 


PC6 

PC7 PC8 

PC9 

VI 

0.24251752 

0.008515668 0.24770729 

0.002747438 
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Figure 3.13: Example of prcomp output. The scree plot in the left panel is pro¬ 
duced by plot on the result from the principal component analysis. The right- 
hand graph is a biplot and shows the variables (as arrows) and observations 
(numbers) plotted against the first two principal components. 


V2 0.13903168 
V3 0.07452713 
V4 0.65462877 
V5 -0.06930891 
V6 -0.60922054 
V7 -0.29889733 
V8 -0.02151820 
V9 -0.14834515 


0.205434260 
0.127209198 
-0.123830400 
-0.211018210 
-0.402790095 
0.700417365 
-0.459782742 
0.132116994 


-0.43629981 
-0.58272674 
0.16343403 
0.45866910 
-0.12665288 
0.38371888 
0.07401187 
-0.05353693 


0.733210938 
-0.667480798 
-0.046019211 
-0.066890623 
0.076510293 
-0.062241047 
0.022078692 
-0.007496101 


> head(predict(fit)) 



PCI 

PC2 

PC3 

PC4 

PC5 

1 

-1.469095 

0.10419679 

-0.56527102 

0.03193593 

-0.15088743 

2 

1.440990 

0.56972390 

0.23642767 

0.47779958 

1.64188188 

3 

-1.591311 

0.07606412 

0.04882192 

0.09232038 

-0.05969539 

4 

1.478728 

0.52806481 

-0.60260642 

-1.40979365 

-0.56032669 

5 

-1.343877 

0.09065261 

0.02997533 

0.33803588 

-0.10874960 

6 

5.010654 

1.53379305 

0.46067165 

-0.29517264 

0.39155544 


PC6 PC7 PC8 PC9 

1 0.05997679 0.3491471 0.4200360 -0.005687222 

2 -0.48268150 -1.1150819 0.3792992 0.023409926 

3 -0.27916615 0.2325697 0.2096465 0.013361828 

4 0.06298211 -0.2109599 -1.6059184 0.182642900 

5 0.43105416 0.2596714 0.4463277 -0.038791241 

6 0.11527442 0.3842529 -0.1489917 -0.042953075 


Input to prcomp can also be specified as a formula with no response 
variable. In this case, missing variables are handled directly by the 

na. action argument. 

> predictors <- biopsy[,2:9] 

> fit <- prcomp(~ data=predictors, scale=TRUE) 
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See also: The princomp function can also be used for principal com¬ 
ponent analysis but the calculations done by prcomp have better nu¬ 
merical accuracy. See Rule 3.32 for an example of principal component 
regression. 


3.32 Fit a principal component regression model 

Problem: You want to use principal component regression analysis to 
investigate the linear relationship between a response variable and a set 
of explanatory principal components variables. 

Solution: Principal component analysis is often used to reduce data 
dimensionality and to overcome problems with collinearity among ob¬ 
served variables. 

Principal component regression uses the principal components iden¬ 
tified through principal component analysis as explanatory variables 
in a regression model instead of the original variables, and principal 
component regression is often seen as a natural "next step" to principal 
component analysis. Typically, only a subset of the principal compo¬ 
nents are used in the regression. 

In the following code we start with a principal component analysis of 
the nine explanatory variables found in the biopsy dataset from the 
MAS S package. There are missing data present in the data frame so we 
restrict our attention to the available complete cases. We then use the 
predicted principal components as input to a logistic regression model 
where the response is the breast tumor type: "benign" or "malignant". 

> library(MASS) 

> data(biopsy) 

> names(biopsy) 

[1] "ID" "VI" "V2" "V3" "V4" "V5" "V6" 

[8] "V7" "V8" "V9" "class" 

> predictors <- biopsy[complete.cases(biopsy),2:10] 

> fit <- prcomp(predictors, scale=TRUE) 

> summary(fit) 

Importance of components: 

PCI PC2 PC3 PC4 PC5 PC6 

Standard deviation 2.429 0.8809 0.7343 0.6780 0.6167 0.5494 

Proportion of Variance 0.655 0.0862 0.0599 0.0511 0.0423 0.0335 

Cumulative Proportion 0.655 0.7417 0.8016 0.8527 0.8950 0.9285 

PC7 PC8 PC9 

Standard deviation 0.5426 0.5106 0.29729 

Proportion of Variance 0.0327 0.0290 0.00982 

Cumulative Proportion 0.9612 0.9902 1.00000 
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> pc <- predict(fit)[,1:4] # Select first 4 principal 


> head (pc) 
PCI 

1 -1.469095 

2 1.440990 

3 -1.591311 

4 1.478728 

5 -1.343877 

6 5.010654 


PC2 

0.10419679 
0.56972390 
0.07606412 
0.52806481 
0.09065261 
1.53379305 


PC3 

-0.56527102 
0.23642767 
0.04882192 
-0.60260642 
0.02997533 
0.46067165 


PC4 

0.03193593 
0.47779958 
0.09232038 
-1.40979365 
0.33803588 
-0.29517264 


comp. 


> y <- biopsy$class[complete.cases(biopsy)] # Get response 

> model <- glm(y ~ pc, family=binomial) # Fit the per 

> summary(model) 

Call: 

glm(formula = y ~ pc, family = binomial) 

Deviance Residuals: 

Min IQ Median 3Q Max 

-3.17913 -0.13036 -0.06188 0.02280 2.47993 


Coefficients: 



Estimate Std 

. Error z 

value 

Pr (> I z | ) 

(Intercept) 

-1.0739 

0.3035 ■ 

-3.539 

0.000402 *** 

pcPCl 

2.4140 

0.2556 

9.445 

< 2e—16 *** 

pcPC2 

0.1592 

0.5050 

0.315 

0.752540 

pcPC3 

-0.7191 

0.3273 ■ 

-2.197 

0.028032 * 

pcPC4 

0.9151 

0.3691 

2.479 

0.013159 * 

Signif. codes: 0 '***' 

0.001 ' * * 

o 

o 

I- 1 

'*' 0.05 


(Dispersion parameter for binomial family taken to be 1) 


Null deviance: 884.35 on 682 degrees of freedom 
Residual deviance: 106.12 on 678 degrees of freedom 
AIC: 116.12 


Number of Fisher Scoring iterations: 8 


We see from the logistic regression analysis that the first principal com¬ 
ponent is highly significant while principal components 3 and 4 are 
barely significant. Principal component 2 is not significant when the 
three other principal components are part of the model, so even though 
principal component 2 explains second-most of the variation (of the 
variables) it has no effect on the responses. 

See also: See Rule 3.31 for principal component analysis and Rule 3.8 for 
logistic regression modelling. The per function from the pis package 
can also be used for principal component regression. 
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3.33 Classify observations using linear discriminant 
analysis 

Problem: You want to find a linear combination of features which can 
be used for classification of two or more classes. 

Solution: Discriminant analysis is a statistical technique for classifi¬ 
cation of data into mutually exclusive groups. In linear discriminant 
analysis we assume that the groups can be separated by a linear combi¬ 
nation of features that describe the objects and with k groups we need 
fc — 1 discriminators to separate the classes. 

The function Ida from the MASS package can be used for linear dis¬ 
crimination analysis. Input for the Ida function is a model formula 
of the form group ~ xl + x2 + • • • where the response group is 
a grouping factor and xl, x2, .. .are quantitative discriminators. The 
prior option can be set to give the prior probabilities of class mem¬ 
bership. If it is unspecified, the probabilities of class membership are 
estimated from the dataset. 

In the following code, we will make a model to classify breast cancer 
type ("benign" or "malignant") based on tumor clump thickness (VI), 
uniformity of cell size (V2) and uniformity of cell shape (V3). The vari¬ 
ables are found in the biopsy data frame found in the MASS package. 

> library(MASS) 

> data(biopsy) 

> fit <- Ida(class ~ VI + V2 + V3, data=biopsy) 

> fit 
Call: 

Ida(class ~ VI + V2 + V3, data = biopsy) 

Prior probabilities of groups: 

benign malignant 
0.6552217 0.3447783 

Group means: 

VI V2 V3 

benign 2.956332 1.325328 1.443231 
malignant 7.195021 6.572614 6.560166 

Coefficients of linear discriminants: 

LD1 

VI 0.2321486 
V2 0.2574805 
V3 0.2500765 

> plot(fit, col="lightgray") 

The prior probability of the groups and the resulting linear discrimina- 
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Figure 3.14: Example of Ida output. Histograms for values of the linear dis¬ 
criminator is shown for observations from both the "benign" and "malignant" 
group. 

tor are both seen in the output. Plotting the fitted model can be seen 
in Figure 3.14 and the type of plot produced depends on the number 
of discriminators. If there is only one discriminator or if the argument 
dimen=l is set, then a histogram is plotted; if there are two or more 
discriminators, then a pairs plot is shown. 

The predict function provides a list of predicted classes (the class 
component) and posterior probabilities (the posterior component) 
when the result from Ida is supplied as input. These can be used to 
evaluate the sensitivity and specificity of the classification. 

> result <- table(biopsy$class, predict(fit)$class) 

> result 


benign malignant 


benign 

malignant 


448 


10 

208 


33 


> sum(diag(result)) / sum(result) 

[1] 0.9384835 

We can see here that the linear discriminant analysis correctly classi¬ 
fies 93.84% of the observations. Note, however, that this result is based 
on a linear discriminator that has been estimated from the same data 
as the classification is evaluated. In practice, we should evaluate the 
classification on a sample that has not been used to estimate the dis¬ 
criminators. In the following example, we use a third of the original 
dataset as a training dataset and the remaining 2/3 as a test dataset. 

> # Split the data up into random training and test samples 
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> train <- sample(1:nrow(biopsy), ceiling(nrow(biopsy)/3)) 

> testdata <- biopsy[-train,] 

> trainfit <- Ida(class ~ VI + V2 + V3, data=biopsy, subset=train) 

> trainfit 
Call: 

Ida(class ~ VI + V2 + V3, data = biopsy, subset = train) 

Prior probabilities of groups: 

benign malignant 
0.6437768 0.3562232 

Group means: 

VI V2 V3 

benign 3.066667 1.366667 1.500000 

malignant 7.024096 6.819277 7.024096 

Coefficients of linear discriminants: 

LD1 

VI 0.1744050 
V2 0.1810598 
V3 0.3385484 

> result2 <- table(testdata$class, 

+ predict(trainfit, testdata)$class) 

> result2 


benign malignant 
benign 302 6 
malignant 24 134 

> sum(diag(result2)) / sum(result2) 
[1] 0.9356223 


Using cross-validation to evaluate the proportion of correctly classi¬ 
fied observations did not change the overall classification rate. Alter¬ 
natively, if the CV argument in the call to Ida is set to TRUE then results 
for classes and posterior probabilities are estimated using leave-one-out 
cross-validation. 

Instead of using a model formula, Ida can also use a data frame or 
matrix, x, that contains the explanatory variables and a factor vector 
grouping that specify the response class for each observation as input. 


See also: The qda function from the MAS S package can perform quadratic 
discrimination analysis. For more advanced discriminant methods check 
out the mda package. 
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3.34 Use partial least squares regression for prediction 

Problem: You want to use partial least squares regression to predict a 
response based on a large set of explanatory variables. 

Solution: Partial least squares (pis) regression combines features from 
principal component analysis and multiple regression and is particu¬ 
larly useful when we want to predict a multivariate response from a 
(potentially very) large set of explanatory variables. 

Pis is commonly applied in situations where the number of explanatory 
variables is so large compared to the number of observations that tradi¬ 
tional regression approaches are no longer feasible or if a large number 
of the explanatory variables are collinear. The partial least squares tech¬ 
nique has seen widespread use in chemometrics where it is used, for 
example, to study the combination of characteristics of a substance and 
the output of wavelengths after a large number of near infrared (NIR) 
wavelengths are passed through the substance. 

The pis method extends principal component analysis in the sense that 
pis extracts orthogonal linear combinations of explanatory variables 
(also known as "factors" or principal components) that explain the vari¬ 
ance in both the explanatory variables and the response variable (s). 

The pis package implements partial least squares regression as well 
as principal component regression. The plsr function computes the 
principal pis components and subsequently make a regression of the 
response on the principal components, plsr accepts a model formula 
as its first argument. 

There are four options that should be set for the plsr function. The 
first is ncomp, which sets the maximum number of principal compo¬ 
nents that are included in the subsequent regression model. The scale 
option can be set to TRUE or to a numeric vector. In the first case, 
then each of the explanatory variables is divided by its standard de¬ 
viation. When set to a numeric vector, then each explanatory variable 
is divided by the corresponding element of scale. The method and 
validation options control the algorithm used to compute the prin¬ 
cipal components and the type of model validation. Their default val¬ 
ues are kernel algorithm ("kernelpls ") and "none" (for no cross- 
valiation), respectively. Alternative computational algorithms that can 
be chosen with the method argument are "widekernelpls " (for the 
wide kernel algorithm), "simpls" (for the SIMPLS algorithm), and 
"oscorespls" for the classical orthogonal scores algorithm. Cross- 
validation may help decide the number of components in the model 
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and cross-validation and leave-one-out cross-validation are performed 
if validation="CV" or if validation="LOO", respectively. 

The gasoline data frame from the pis package contains information 
on the octane number for 60 gasoline samples as well as corresponding 
NIR spectra with 401 wavelengths. We use the first 50 samples to create 
a pis regression model and will evaluate the prediction error on the re¬ 
maining 10 samples. We use leave-one-out validation with the training 
data to get estimates of the mean squared error for a different number 
of components. 

> library(pis) 

> data(gasoline) 

> gasTrain <- gasoline[1:50,] # Training dataset with 50 samples 

> gasTest <- gasoline[51:60,] # Test dataset with 10 samples 

> gasl <- plsr (octane ~ NIR, ncomp=5, 

+ data=gasTrain, validation="LOO") 

> summary(gasl) 

Data: X dimension: 50 401 

Y dimension: 50 1 
Fit method: kernelpls 
Number of components considered: 5 


VALIDATION: RMSEP 

Cross-validated using 50 leave-one-out segments. 



(Intercept) 

1 comps 

2 comps 

3 comps 

4 comps 

5 comps 

cv 

1.545 

1.357 

0.2966 

0.2524 

0.2476 

0.2398 

adjCV 

1.545 

1.356 

0.2947 

0.2521 

0.2478 

0.2388 

TRAINING: % variance explained 





1 comps 2 

comps 3 

comps 4 

comps 5 

comps 


X 

78.17 

85.58 

93.4 

96.06 

96.94 


octane 

29.39 

96.85 

97.9 

98.26 

98.86 



The summary output shows the percentage of the variance explained in 
both the octane response variable and the explanatory variables based 
on the training data. For both variables we have more than 95% of the 
variation explained from the first 4 principal components. The summary 
output also includes a "Validation" section since we used leave-one-out 
cross-validation. This section shows the estimated root mean squared 
error from the bias-corrected cross-validation (theadjCV) and the root 
mean squared error appears to be fairly stable around 0.25 after 3 prin¬ 
cipal components. Thus we decide on three components, summary 
only provides a summary of the pis result. To get the actual scores and 
loading matrics from the fit we can use the scores and loadings 
function, respectively. They both require a fitted plsr object as input. 
The predict function can extract the predictions from the training set 
and it returns a three-dimensional array of predicted response values. 
The dimensions correspond to the observations, the response variables 
and the model sizes, respectively. We can also use plot to make a 
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Figure 3.15: Partial least squares prediction of octane from NIR spectra (left 
plot), and factor loadings (right plot). 


prediction plot directly if we use the fitted object as first argument and 
set the number of components to include as the ncomp option. The 
result is seen in the left panel of Figure 3.15 which shows a high level 
of prediction, plot has been modified for objects fitted with plsr, so 
if we add the option plottype=" loadings" then the different factor 
loadings are plotted (see the right-hand plot of Figure 3.15, where we 
have used comps=l : 3 to get curves for the first three components). 

> plot(gasl, ncomp=3, line=TRUE) 

> plot(gasl, "loadings", comps=l:3) 

We check for overfitting by predicting the values of the test dataset 
based on the principal components computed from the training data. 
The root mean squared error for the test data can be directly computed 
with the RMSEP function when the fitted object is included as first ar¬ 
gument, and the new test data are include as argument newdata. 

> RMSEP(gasl, newdata=gasTest) 

(Intercept) 1 comps 2 comps 3 comps 

1.5369 1.1696 0.2445 0.2341 

4 comps 5 comps 

0.3287 0.2780 

The root mean squared errors from the test data show that we have 
the same order of prediction error for three principal components as 
we saw from the cross-validation of the training data. Thus, there is 
nothing that suggests that we are overfitting the model when we use 
three principal components. 

See also: An example of principal component regression is shown in 
Rule 3.32, and cross-validation is also discussed in Rule 3.36. 
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Resampling statistics and bootstrapping 


3.35 Non-parametric bootstrap analysis 

Problem: You want to use non-parametric bootstrap to evaluate the 
properties and accuracy of a statistic. 

Solution: Bootstrap is a simple but computationally intensive method 
to assess statistical accuracy or to estimate the distribution of a statistic. 
The idea behind the bootstrap procedure is to repeatedly draw random 
samples from the original data with replacement and then use these 
"artificial" samples to compute the statistic of interest a large number 
of times thereby estimating the distribution of the statistic. 

The boot package provides extensive facilities for bootstrapping and 
related resampling methods. The boot function generates bootstrap 
samples and calculates the statistic and requires three arguments to be 
specified: data for the data frame, statistic for the function that 
calculates the statistic of interest, and the number of bootstrap repli¬ 
cates, R. The data argument can be either a vector, matrix or data frame 
and should comprise the original data. If the data is a matrix or data 
frame, then each row is considered one observation; i.e., we resample 
from the individuals rows. The statistic function should be a func¬ 
tion which calculates and returns the statistic(s) of interest. The func¬ 
tion specified by the statistic option should take at least two ar¬ 
guments. The first object passed to the function is the original dataset 
given by the data option, and the second argument is a vector of in¬ 
dices which define the indices that constitute the non-parametric boot¬ 
strap sample. Finally, R is an integer that sets the number of bootstrap 
samples. Additional parameters to be passed to the function that pro¬ 
duces the statistic of interest can be added to the boot call. 

We illustrate the non-parametric bootstrap procedure with an example 
from genetics. In a balanced half-sib animal experiment with n sires 
and r progeny per sire, we can calculate an estimate of the heritability 
for a quantitative trait as 

j 2 4(MS s i re MS w ithin) 

(r + 1 )MS sire htS vv ithin 

where MS s j re and MS W i t hi n are the mean square values for the among 
sires and within sires variance sources obtained from an analysis of 
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variance table. We can use the bootstrap analysis to provide confidence 
limits for the heritability estimate. 

We simulate artificial data by first generating the mean level for each 
sib-ship from a normal distribution and then subsequently generating 
measurements for the individual half-sibs. Each row in the my data ma¬ 
trix below corresponds to measurements from five half-sibs and there 
are 30 rows — one for each sire. 


> library(boot) 

> set.seed(1) # Set seed to reproduce simulations 

> mydata <- rnorm(150, mean=rep(rnorm{30), 5)) 

> mydata <- matrix(mydata,ncol=5) 

> head(mydata) 

[,1] [, 2 ] [, 3] [, 4] [,5] 

[1,] 0.7322257 1.7751639 -1.1689738 -1.1324113 -0.1762667 

[2,] 0.0808556 0.1444033 1.3915111 1.5266821 0.1650835 

[3,] -0.4479570 -0.1458892 0.3247740 -1.0502080 -1.1536970 

[4,] 1.5414758 1.6232830 2.2954945 1.4157243 0.6659187 

[5,] -1.0475518 -0.4137654 1.9163412 0.2293170 -1.1579525 

[6,] -1.2354629 -0.6316761 -0.2619820 -0.1078021 -1.8956607 

The following function calculates the heritability based on the balanced 
half-sib design. The function should take two arguments: the first 
should be the data frame or matrix, and the second should be a vec¬ 
tor of indices or weights that define the current bootstrap sample. We 
set a default value for the second argument such that the full original 
dataset is used if the desired subsample is not specified. The stack 
function is used in combination with the matrix transpose function, t, 
to transform the input data matrix to the form needed for lm. 

> heritability <- function(data, indices=l:nrow(data)) { 

+ r <- ncol(data) 

+ df <- stack(as.data.frame(t(data[indices,]))) 

+ result <- anova(lm(values ~ ind, data=df)) 

+ mssire <- result[1,3] 

+ mswithin <- result[2,3] 

+ heritab <- 4*(mssire - mswithin) / ((r+1)*mssire - mswithin) 

+ return(heritab) 

+ } 

> heritability(mydata) 

[1] 0.5533472 

The heritability of the original data frame is 0.5533. 

The newly defined function can be used with the boot function for 
computing non-parametric bootstrap estimates of the standard error. 
If plot is applied to the output from the bootstrap procedure, then a 
histogram and a q-q plot of the estimated statistics are plotted. 

> result <- boot(mydata, heritability, R=10000) 
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Figure 3.16: Bootstrap estimates of heritability for half-sub design. 


> result 

ORDINARY NONPARAMETRIC BOOTSTRAP 
Call: 

boot(data = mydata, statistic = heritability, R = 10000) 

Bootstrap Statistics : 

original bias std. error 

tl* 0.5533472 -0.03415504 0.094238 

> plot(result) 

The estimated heritability for the original data is shown in the bootstrap 
output as well as the bias and the estimated standard error. The bias is 
—0.0341 which shows that the difference between the original heritabil¬ 
ity estimate and the mean of the bootstrap sample statistics is very close 
to zero so there is hardly any bias. The standard error is 0.094238 so 
in principle we could construct a 95% confidence interval for the her¬ 
itability by using this value. However, Figure 3.16 suggests that the 
bootstrap heritability estimates are far from symmetric and Gaussian, 
so the 95% confidence interval will not be well approximated by twice 
the standard error. 

Instead, the boot. ci function (also found in the boot package) can be 
used to compute a bootstrap percentile confidence interval and a bias- 
adjusted bootstrap percentile confidence interval, boot. ci needs the 
output from a boot call as input. 

> boot.ci(result) 

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS 
Based on 10000 bootstrap replicates 
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CALL : 


boot.ci(boot.out 

= result) 


Intervals : 

Level Normal 


Basic 

95% ( 0.4028, 

0.7722 ) 

( 0.4986, 

Level Percentile 

BCa 

95% ( 0.2564, 

0.6081 ) 

( 0.3851, 

Calculations and 

Intervals 

on Original 


0.8503 ) 

0.6181 ) 
Scale 


The Normal interval corresponds to the interval based on twice the 
standard error (after correction for bias) and the Percentile confi¬ 
dence interval is based on the empirical "middle" 95% of the bootstrap 
heritabilities and it is clearly asymmetric. The BCa interval is the bias- 
adjusted percentile interval. Here we conclude that the 95% bootstrap 
confidence interval for the heritability would be (0.3851; 0.6181). 

See also: The above example describes non-parametric bootstrap. Para¬ 
metric bootstrap or permutation analysis are handled by boot with the 
sim="parametric" or sim="permutation" options, respectively. 


3.36 Use cross-validation to estimate the performance of 
a model or algorithm 

Problem: You want to estimate the performace of a model or algo¬ 
rithm/learning method using cross-validation. 

Solution: Cross-validation is a useful technique for evaluating the 
performance and stability of a model or algorithm. Cross-validation 
can also be used to compare the performance of several different algo¬ 
rithms, or to choose among various models. 

In /.'-fold cross-validation the data is partitioned into /.: subsets of ap¬ 
proximately equal size. Subsequently, each of the k subsets are left 
out in turn while the remaining k — 1 subsets are used to "train" the 
model (i.e., fit the model or run the algorithm). The omitted subset 
is used to compute the error rate for the resulting model. If k equals 
the sample size, then A:-fold cross-validation is called "leave-one-out" 
cross-validation. The overall performance is estimated as the average 
error rate for the k results. 

A'-fold cross-validation prediction error for generalized linear models 
can be computed using the cv. glm function from the boot package. 
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c v. glm requires a matrix or data frame containing the full data (where 
each row corresponds to a case) as first argument, and an object from 
a glm model fit as second argument. The options K and cost set the 
number of groups and the function used to evaluate the fit of model. 
By default, the cost function is the average squared error function, 
cv. glm returns a list with components K which is the number of groups 
used, and delta which is a vector of length 2, where the first element is 
the raw cross-validation estimate of the prediction error, and the second 
element is an adjusted cross-validation estimate of the prediction error. 
If the number of observations in the training set is relatively small, then 
bias may be introduced and we will overestimate the true prediction 
error unless it is adjusted. 

In the example below, we look at the prediction error of the trees data. 

> data(trees) 

> fitl <- glm(Volume ~ Girth + Height, data=trees) 

> cv.err.loo <- cv.glm(trees, fitl) 

> cv.err.loo$delta 

1 1 

18.15783 18.07781 

> cv.err.3 <- cv.glm(trees, fitl, K=3) # 3 fold cross validation 

> cv.err.3$delta 

1 1 

18.24974 17.30670 

Thus we conclude that the average squared prediction error for the 
leave-one-out analysis is around 18.158, and we get almost the same re¬ 
sult when using 3-fold cross-validation. The bias-adjusted cross-valida¬ 
tion errors are almost the same for both situations, but we notice that 
the difference between the unadjusted and adjusted prediction errors is 
larger for 3-fold cross-validation than for leave-one-out cross-validation. 
If we wish to compare the prediction error between different models, 
then we can use cross-validation to make sure we are not over-fitting 
one of the models. Since we are looking at log-transformed variables 
below we need to define our own cost function such that prediction 
is measured on the same scale as before. The cost function should take 
two arguments: the responses used for the test dataset and the expected 
values computed from the training data. 

> fit2 <- glm(log(Volume) ~ log(Girth) + log(Height), data=trees) 

> mycost <- function(responses, expected) { 

+ mean((exp(responses) - exp(expected))**2) } 

> cv2.err.loo <- cv.glm(trees, fit2, mycost) 

> cv2.err.loo$delta 

1 1 
6.984000 6.962766 

The leave-one-out prediction error of 6.984 for the model with trans- 
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formed variables shows a substantial improvement in prediction error 
compared to the model without transformed variables. 

The generic crossval function from the bootstrap package can be 
used for general fc-fold cross-validation, crossval needs a minimum 
of four arguments: A matrix of explanatory variables, x, where each 
row corresponds to a case, a vector of response values, y, and func¬ 
tions theta . fit and theta .predict which are the function to be 
cross-validated and the function to compute the predicted values, re¬ 
spectively. 

Note that crossval converts the x object to a matrix before it is passed 
to theta, fit and theta. predict. Thus, if crossval is used for 
data frames that contain non-numeric values then the two functions 
need to convert the variables in the input object to the proper formats 
before fitting and prediction. 

Below we wish to use cross-validation to check for over-fitting when a 
single-layer neural network is used to predict exercise status from gen¬ 
der, age, and smoking status for the survey dataset. We use the nnet 
function from the nnet package to fit the neural network, and we use 
a "trick" here to circumvent the fact that crossval passes the x value 
as matrices since we have both numeric and categorical explanatory 
variables in our data. Instead we create a vector, index that is used 
to identify the observations in each group and use the index vector 
inside theta . fit and theta . predict to extract the correct subsets. 

> library(bootstrap) # For the crossval function 

> library(nnet) # For the nnet function 

> library(MASS) # For the survey data frame 

> data(survey) 

> fit <- nnet(Exer ~ Sex + Smoke + Age, data=survey, 

+ size=10, rang=.05, maxiter=200, trace=FALSE) 

> fit 

a 5-10-3 network with 93 weights 

inputs: SexMale SmokeNever SmokeOccas SmokeRegul Age 
output(s): Exer 

options were - softmax modelling 

> # Calculate the number of misclassifications 

> cc <- complete.cases(survey$Exer, survey$Sex, 

+ survey$Smoke, survey$Age) 

> sum (predict (fit, type =,, class" ) != survey$Exer [cc] ) 

[1] 109 

> index <- 1:nrow(survey) # Work-around to identify obs. 

> # theta.fit fits the model. It uses the indices passed through 

> # x to select the subset of the survey data frame. The response 

> # y is not really used since we get the correct response with 

> # the subset argument to nnet 

> theta.fit <- function(x,y) { 

+ nnet(Exer ~ Sex + Smoke + Age, data=survey, size=10, 
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+ subset=x, rang=.05, maxiter=200, trace=FALSE) } 

> # theta.predict returns the predicted class. Again, indices for 

> # the test data are passed through x and selected from survey 

> theta.predict <— function(fit, x) { 

+ predict(fit, survey[x,], type="class") } 

> o <- crossval(index, survey$Exer, theta.fit, theta.predict, 

+ ngroup=5) 

> table(survey$Exer, o$cv.fit) 



Freq 

None 

Some 

Freq 

68 

5 

41 

None 

15 

0 

8 

Some 

59 

2 

37 


> sum(o$cv.fit != survey$Exer, na.rm=TRUE) 

[1] 130 

There were 109 misclassifications for the neural network fit of the full 
data. Cross-validation showed a slightly larger number of misclassifi¬ 
cations, 130, which suggests that we are over-fitting the data with the 
full dataset. 


3.37 Calculate power or sample size for simple designs 

Problem: You are in the planning phase of an experiment and want 
to estimate the number of observations you need to obtain a certain 
statistical power. 

Solution: Sample size and power calculations are typically used to 
show that under certain conditions, the hypothesis test has a desired 
chance of detecting an effect if it indeed exists. 

Educated sample size and power calculations require that researchers 
have prior knowledge about the variables they are measuring and the 
statistical analysis they intend to use, and often the calculations rely on 
a combination of crude estimates and subjective guesses. While it is 
possible to calculate the necessary sample size (to achieve a certain sta¬ 
tistical power) or the statistical power (given a certain sample size), it is 
often unfeasible in many real situations simply because all but the sim¬ 
plest experimental designs require so many guesses and assumptions 
that are difficult to validate. 

R provides functions for sample size calculations for simple designs 
(one- and two-sample, and paired t, tests), one-way analysis of variance. 
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as well as two-sample test for proportions through the power . t. test, 
power . anova . test, and power . prop. test functions, respectively. 
The power . t. test function computes power for one- and two-sample 
t tests. It takes five main arguments: the sample size, n, delta which is 
the true difference in mean values, power for the statistical power, sd 
for the standard deviation, and sig. level for the significance level. 
If exactly four of the five arguments are specified, then power . t. test 
will calculate the final parameter. Default values for sig. level and 
sd are 0.05 and 1, respectively, power . t. test assumes a two-sample 
t test with equal variances and group sizes by default. Setting the type 
option to "one. sample" or "paired" produces calculations based 
on one-sample and paired tests. 

To calculate the sample size needed in order to use a two-sample t test 
to detect a mean difference of 2 when the standard deviation within 
each group is 3.6 with a power of 80% and at a significance level of 5% 
we type the following. 

> power.t.test(delta=2, sd=3.6, power=.8, sig.level=0.05) 

Two-sample t test power calculation 

n = 51.83884 
delta = 2 
sd = 3.6 

sig.level = 0.05 
power = 0.8 

alternative = two.sided 
NOTE: n is number in *each* group 

Thus we need a total of 104 individuals (52 in each of the two groups) to 
obtain the desired power given our prior knowledge of the true mean 
difference and the standard deviation. 

Sometimes, sample size calculations are based on a prior estimate of 
the effect size, i.e., the ratio of the true mean difference to the standard 
deviation, 

ES = w-m = y 

a a 

where S is the difference in means and a is the standard deviation. 

If we only know effect size then one of the parameters, say the stan¬ 
dard deviation, can be fixed, and we just change the mean difference to 
obtain the desired effect size. For example, to compute the power of a 
one-sample t test with 20 individuals where we are looking for an effect 
size of 0.70 we can use this code. 

> power.t.test(n=20, delta=.70, sd=l, sig.level=0.05, 

+ type="one.sample") 
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One-sample t test power calculation 


n 

delta 

sd 

sig.level 
power 
alternative 


20 

0.7 

1 

0.05 

0.8435055 
two.sided 


The statistical power of the experiment is 76%. 

The power . prop. test function computes the power or sample size 
for two-sample comparisons of proportions, power . prop . test has 
five main arguments: the sample size in each group, n, the probabilities 
of success in the two groups, pi and p2, as well as the significance level, 
sig. level and the power, power. Exactly one of the five parameters 
should be passed as NULL, and that parameter will be calculated based 
on the other four. 

The following code computes the necessary sample size to obtain a 
power of 80% when testing equality of two proportions, where the true 
probabilities of success are 43% and 75% at a 5% significance level. 

> power.prop.test(pl=0.43, p2=0.75, power=.8, sig.level=0.05) 

Two-sample comparison of proportions power calculation 

n = 35.88090 
pi = 0.43 
p2 = 0.75 
sig.level = 0.05 
power = 0.8 

alternative = two.sided 
NOTE: n is number in *each* group 


Consequently, 36 individuals are needed in each of the two groups. 
Finally, the power . anova . test can be used to compute power for 
balanced one-way analysis of variance designs. Like the two previ¬ 
ous functions, power . anova . test needs input on all but one of the 
six parameters: groups (the number of groups or categories), n (the 
number of observations per group), the between group variance (op¬ 
tion between . var), the variance within groups (within . var), the 
significance level (sig. level), and power. Note that it is the between 
and within group variances and not standard deviations that should be 
specified! 

For example, if we have prior information on the group means and the 
variance within groups then we can calculate the necessary sample size 
as follows: 
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Figure 3.17: Power plots for a t test as a function of sample size (left panel) and 
effect size (right panel). Three different effect sizes are shown in the left plot: 0.7 
(solid line), 0.8 (dashed line) and 1.0 (dotted line), while three different group 
sample sizes are shown in the right plot: 25 (solid line), 40 (dashed line), and 60 
(dotted line). 


> groupmeans <- c(4.5, 8, 6.2, 7.0) 

> power.anova.test(groups = length(groupmeans), 

+ between.var=var(groupmeans) , 

+ within.var=5.8, power=.80, sig.level=0.05) 

Balanced one-way analysis of variance power calculation 

groups = 4 

n = 10.65777 
between.var = 2.189167 
within.var = 5.8 
sig.level = 0.05 
power = 0.8 

NOTE: n is number in each group 


So we would need a total of 44 individuals (11 in each of the four 
groups) to obtain the desired power. 

Power considerations are often used for research proposals and grant 
applications and in those situations it is sometimes interesting to show 
how sensitive the power is to the choice of sample or effect size. The 
code below produces the two plots shown in Figure 3.17 where the 
power from a t test is plotted against the sample size (left plot) for dif¬ 
ferent effect sizes and against the effect size for fixed group sample sizes 
(right plot). 
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> n <- seq(20, 50) # Consider sample sizes from 20 to 50 

> plot(n, power.t.test(n, delta=.7)$power, type= n l n , 

+ ylim=c(0.55,1), ylab="Power", 

+ xlab="Number of observations per group") 

> lines(n, power.t.test(n, delta=.8)$power, lty=2) 

> lines(n, power.t.test(n, delta=l)$power, lty=3) 

> delta <- seq(0, 1.6, .1) # Consider effect sizes from 0 to 1.6 

> plot(delta, power.t.test(n=25, delta=delta)$power, 

+ type="l", ylim=c(0,l), xlab="Effeet size", 

+ ylab="Power") 

> lines(delta, power.t.test(n=40, delta=delta)$power, lty=2) 

> lines(delta, power.t.test(n=60, delta=delta)$power, lty=3) 


See also: The pwr package provides additional functions for power 
calculations for specific designs and tests. 


Robust statistics 


3.38 Correct p-values for multiple testing 

Problem: You have made several simultaneous statistical tests and 
want to correct the p-values for multiple testing. 

Solution: The multiple testing problem refers to the situation where 
several tests or statistical inferences are considered simultaneously as a 
group. Each test has a false positive rate (the type I error rate) which 
is equal to the threshold set for statistical significance, generally 5%. 
However, when more than one test is done then the overall type I error 
rate (i.e., the probability that at least one of the test results is a false 
positive) is much greater than 5%. 

Multiple testing corrections adjust p-values derived from multiple sta¬ 
tistical tests in order to control the overall type I error rate and reduce 
the occurrence of false positives. 

The p . ad just function takes a vector of p-values as input and returns 
a vector of p-values adjusted for multiple comparisons. The method ar¬ 
gument sets the correction method, and the values "holm" for Holm's 
correction (the default), "bonferroni" for Bonferroni correction, and 
ii tdr " (for false discovery rate) are among the possibilities. The Holm 
and Bonferroni methods both seek to control the overall false positive 
rate with Holm's method being less conservative than the Bonferroni 
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method. The f dr method controls the false discovery rate which is a 
less stringent criterion than the overall false positive rate. 

We illustrate the p. ad just function with a typical microarray gene 
expression experiment found as the superroot2 data frame in the 
kulife package. The expression levels of 21,500 different genes are 
examined and for each gene we wish to determine if there is any dif¬ 
ference in gene expression between two plant types (mutant and wild 
type) after correcting for dye color and array. A standard analysis of 
variance model is used to test the hypothesis of no difference in gene ex¬ 
pressions between the two types of plants for each of the 21,500 genes. 
The sapply function is combined with anova and 1m to calculate the 
21,500 p-values. We would expect 1075 genes to be deemed "signifi¬ 
cant" by chance with a standard p-value cut-off of 0.05. 


> library(kulife) 

> data(superroot2) 

> geneid <- unique(superroot2$gene) # Get vector of gene names 

> # Analyze each gene separately and extract the p-value 

> pval <- sapply(1:21500, function(i) { 

+ anova(lm(log(signal) ~ array + color + plant, 

+ data=superroot2, 

+ subset=(gene==geneid[i]))) [3,5] }) 

> head(sort(pval), 8) # Show smallest p-values 

[1] 1.048266e-05 1.781166e-05 2.880092e-05 3.298942e-05 

[5] 3.410671e-05 3.438769e-05 3.503237e-05 4.076954e-05 

> head(p.adjust(sort(pval)), 8) # Holm correction 

[1] 0.2253667 0.3829150 0.6191334 0.7091405 0.7331238 0.7391290 
[7] 0.7529507 0.8762190 

> head(p.adjust(sort(pval), method="fdr"), 8) # FDR 

[1] 0.1075944 0.1075944 0.1075944 0.1075944 0.1075944 0.1075944 
[7] 0.1075944 0.1095631 

The results from p.adjust show that the smallest p-value is 0.2254 
after correction with the Holm method, which suggests than none of 
the 21,500 appear to be significantly differentially expressed when ana¬ 
lyzed using analysis of variance. If we control the false discovery rate, 
then we get slightly smaller p-values than for Holm's method but there 
are still no significant gene expressions after correcting for the number 
of tests. 

Note that implicit in all multiple testing procedures for adjusting p- 
values is the assumption that the distribution of the p-values is correct. 
If this assumption is violated, then robust methods such as for exam¬ 
ple resampling methods should be used to compute the p-values before 
adjusting for the number of tests. 

See also: The mult comp package allows for simultaneous multiple 
comparisons of parameters whose estimates are generally correlated. 
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Non-parametric methods 


3.39 Use Wilcoxon's signed rank test to test a sample me¬ 
dian 

Problem: You want to use a non-parametric method to test if a single 
sample is symmetric around a median value, //. 

Solution: Wilcoxon's signed rank sum test is a non-parametric test that 
is used to test the median of a sample of ordinal or quantitative data. 
The test can also be used to compare two-sample paired data by taking 
differences of the paired observations and testing if the median of the 
differences equals zero. 

The function wilcox.test computes Wilcoxon's signed rank test sta¬ 
tistic and computes the corresponding p-value, and the function han¬ 
dles both one-sample and paired data. The one-sample situation oc¬ 
curs by default if only a single vector is given as input and the paired 
situation if two vectors are given and the option paired=TRUE is set. 
For both the one-sample and the paired data a Wilcoxon signed rank 
test of the null hypothesis that the distribution of the data is symmetric 
around a median value pis performed, and the value of p is set by the 
option mu which defaults to zero. 

> x <- rbinom(50, size=l, p=.6) # Generate random data 

> x 

[ 1 ] 100101100111001100111011011000 
[31] 11001111011011111110 

> wilcox.test(x, mu=0.5) # Test median equal to 0.5 

Wilcoxon signed rank test with continuity correction 


data: x 

V = 790.5, p-value = 0.09074 

alternative hypothesis: true location is not equal to 0.5 

> x2 <- x - 0.5 # Deduct 0.5 from all observations 

> wilcox.test (x2) # Make the same test with median 0 

Wilcoxon signed rank test with continuity correction 
data: x2 

V = 790.5, p-value = 0.09074 

alternative hypothesis: true location is not equal to 0 

> samplel <- rbinom(12, size=4, p=0.5) 
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> samplel 

[1] 141342142223 

> sample2 <- samplel + rbinom(12, size=2, p=0.5) - 1 

> sample2 

[1] 151242142224 

> wilcox.test(samplel, sample2, paired=TRUE) 

Wilcoxon signed rank test with continuity correction 

data: samplel and sample2 

V = 2, p-value = 0.7728 

alternative hypothesis: true location shift is not equal to 0 
Warning messages: 

1: In wilcox.test.default(samplel, sample2, paired = TRUE) : 

cannot compute exact p-value with ties 
2: In wilcox.test.default(samplel, sample2, paired = TRUE) : 
cannot compute exact p-value with zeroes 


In the first example above, we get a test statistic of 790.5 with a corre¬ 
sponding p-value of 0.0907 when we test if the median of the sample x 
is 0.5. Hence, we fail to reject the null hypothesis that the median could 
be 0.5. We get the same result if we first subtract the desired value of // 
and then use wilcox . test to test for a median value of zero, 
wilcox.test uses a normal approximation to calculate p-values un¬ 
less the sample contains less than 50 values or the option exact=TRUE 
is specified. Also, wilcox.test cannot compute exact tests in the 
presence of ties. However, the coin package provides the function 
wilcoxsign_test which handles exact tests even when tied values 
are present. The argument distribution=exact () should be used 
with wilcoxsign_test to ensure that exact test probabilities are com¬ 
puted. 

For both the one-sample and the paired case the model formula used 
with wilcoxsign_test should be specified as x ~ y. Here, x and y 
are the two numeric vectors (for the paired case) or the numeric vector 
x and a vector of values for y with the same length as x (for the one- 
sample case). 

> library(coin) 

> mu <- rep(.5, length(x)) 

> wilcoxsign_test(x ~ mu, distribution=exact()) 

Exact Wilcoxon-Signed-Rank Test (zeros handled a la 
Pratt) 

data: y by 

x (neg, pos) 
stratified by block 
Z = 1.6971, p-value = 0.1189 

alternative hypothesis: true mu is not equal to 0 
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> wilcoxsign_test(samplel ~ sample2, distribution=exact()) 

Exact Wilcoxon-Signed-Rank Test (zeros handled a la 
Pratt) 

data: y by 

x (neg, pos) 
stratified by block 
Z = -0.5774, p-value = 1 

alternative hypothesis: true mu is not equal to 0 

For the one-sample case we first generate a vector of values under the 
null hypothesis and obtain an exact p-value of 0.1189. Hence we fail 
to reject the null hypothesis that // = 0.5. In the paired case we get 
a p-value of 1 and the exact test provides the same conclusion as the 
asymptotic test seen above. 

See also: See Rule 3.40 for a non-parametric rank test for two indepen¬ 
dent samples and Rule 3.41 for a non-parametric rank test for several 
independent samples. 


3.40 Use Mann-Whitney's test to compare two groups 

Problem: You want to use a non-parametric test to compare if two 
independent groups of sampled data have equally large values. 

Solution: The Mann-Whitney U test (also called the Mann-Whitney- 
Wilcoxon or Wilcoxon rank-sum test) is a non-parametric test that is 
used to compare two independent groups of ordinal or quantitative 
sample data. The test can be employed when the normality assump¬ 
tion of a two-sample /-test is not met. 

The function wilcox . test calculates the Mann-Whitney U test statis¬ 
tic and tests the hypothesis that the observations from one group are not 
larger than observations from the other group. Input to wilcox.test 
can either be a formula or two vectors representing the values from the 
two groups. 

In the example below we use the ToothGrowth data to examine if the 
supplement type (orange juice or ascorbic acid) of vitamin C influences 
the length of teeth in guinea pigs. 

> data(ToothGrowth) 

> wilcox.test(len ~ supp, data=ToothGrowth) 
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Wilcoxon rank sum test with continuity correction 

data: len by supp 

W = 575.5, p-value = 0.06449 

alternative hypothesis: true location shift is not equal to 0 
Warning message: 

In wilcox.test.default(x = c(15.2, 21.5, 17.6, 9.7, 14.5, 10, : 
cannot compute exact p-value with ties 

We obtain a test statistic of 575.5 which yields a p-value of 0.06449. Thus 
we (barely) fail to reject the null hypothesis that the two groups have 
equally large values. 

wilcox.test uses a normal approximation to calculate p-values un¬ 
less the samples contain less than 50 values or the option exact=TRUE 
is given. Also, wilcox.test cannot compute exact p-values when ties 
are present in the data. However, the coin package provides the func¬ 
tion w i 1 c o x_t e s t which handles exact tests even when tied values are 
present. The option distribution=exact () should be used with 
wilcox_test to ensure that exact test probabilities are computed. 

> library(coin) 

> wilcox_test(len ~ supp, data=ToothGrowth, distribution=exact()) 

Exact Wilcoxon Mann-Whitney Rank Sum Test 

data: len by supp (OJ, VC) 

Z = 1.8562, p-value = 0.06366 

alternative hypothesis: true mu is not equal to 0 

The conclusion from the exact test is virtually identical to the result 
from the asymptotic test shown above. We get a p-value of 0.06366 
and fail to reject the hypothesis that the two groups have equally large 
values. 

See also: Use Wilcoxon's signed rank sum test (Rule 3.39) to test if paired 
two-sample data are identical. Rule 3.41 extends the Mann-Whitney U 
test to more than two groups. 


3.41 Compare groups using Kruskal-Wallis' test 

Problem: You want to compare three or more independent groups of 
sampled data using a non-parametric test. 

Solution: The Kruskal-Wallis test is a non-parametric test that can be 
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used to compare two or more independent groups of ordinal or quanti¬ 
tative sample data when the normality assumption of one-way analysis 
of variance is not met. 

For the Kruskal-Wallis test, the null hypothesis assumes that the sam¬ 
ples come from identical populations, and the test is identical to a one¬ 
way analysis of variance with the data replaced by their ranks. It is an 
extension of the non-parametric Mann-Whitney U test to three or more 
groups. 

In the example below we use the ToothGrowth data to examine if 
the dose (0.5, 1, or 2 mg) of vitamin C influences the length of teeth 
in guinea pigs, kruskal. test computes the Kruskal-Wallis test and 
accepts input either as a set of vectors, where each vector contains data 
from one of the k groups that are compared or as a model formula. We 
use the model formula approach below. 

> data(ToothGrowth) 

> # Use only data on ascorbic acid 

> teeth <- subset (ToothGrowth, supp=="VC n ) 

> kruskal.test(len ~ factor(dose), data=teeth) 

Kruskal-Wallis rank sum test 
data: len by factor(dose) 

Kruskal-Wallis chi-squared = 25.0722, df = 2, p-value = 

3.594e-06 

The p-value for Kruskal-Wallis' test is practically zero so for these data 
we reject the null hypothesis that the teeth lengths are identical for all 
three doses. 

When the obtained value of a Kruskal-Wallis test is significant, it indi¬ 
cates that at least one of the groups is different from at least one of the 
other groups. The kruskalmc function from the pgirmess package 
can be used to make pairwise multiple comparisons after a significant 
Kruskal-Wallis test to determine which of the groups are different. 

> library(pgirmess) 

> kruskalmc(len ~ factor(dose) , data=teeth) 

Multiple comparison test after Kruskal-Wallis 
p.value: 0.05 

Comparisons 

obs.dif critical.dif difference 
0.5-1 10.3 9.425108 TRUE 

0.5-2 19.7 9.425108 TRUE 

1-2 9.4 9.425108 FALSE 

The output from kruskalmc shows that the differences are found be¬ 
tween dose 0.5 and the other two doses and that dose 1 and dose 2 are 
not significantly different (but almost since the observed difference is 
9.4 and the critical difference is 9.42). 
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Note that kruskal .test uses an asymptotic normal distribution to 
calculate the p- value. Alternatively, the kruskal_test from the coin 
package accepts the same input as kruskal .test but can compute 
approximate p-values using Monte Carlo resampling. The argument 
distribution=approximate () ensures that kruskal_test com¬ 
putes approximate test probabilities are computed instead of asymp¬ 
totic probabilities. The number of Monte Carlo replications is 1000 by 
default but the B option to the approximate function changes the 
number of replications. For example, 9999 Monte Carlo replications 
are obtained with distribution=approximate (B=9999). 

> library(coin) 

> kruskal_test(len ~ factor(dose), data=teeth, 

+ distribution=approximate()) 

Approximative Kruskal-Wallis Test 

data: len by factor(dose) (0.5, 1, 2) 

chi-squared = 25.0722, p-value < 2.2e-16 

Again, we reject the null hypothesis that the teeth lengths are identical 
for all three doses. 

See also: Rule 3.5 explains how to make one-way analysis of variance. 
The oneway . test function can test equality of means without assum¬ 
ing equal variances in each group. See Rule 3.40 for the two-group 
Mann-Whitney test. 


3.42 Compare groups using Friedman's test for a two- 
way block design 

Problem: You want to use a non-parametric test to compare the dis¬ 
tribution of observations from k groups when there are measurements 
from b randomized blocks for each group. 

Solution: Friedman's test is a non-parametric test used to compare the 
distribution of observations from k groups from a balanced block de¬ 
sign without replications; i.e., there is exactly one observation for each 
combination of group and block. Friedman's test is a non-parametric 
alternative for analysis of variances with repeated measurements (two- 
way analysis of variance) where the same parameter has been mea¬ 
sured under different conditions (the groups) on the same "subjects" 
(the blocks) and where the assumption of normality may be violated. 


www.Ebook777.com 



184 


The R Primer 


Friedman's test is implemented in R by the f riedman . test function, 
and the function tests the null hypothesis that — apart from the ef¬ 
fect of the blocks — the location parameter is the same for each of the 
groups. The input to f riedman .test can either be three vectors cor¬ 
responding to the response values, the grouping factor and the block¬ 
ing factor, respectively, or the input can be a model formula of the 
formresponse -groups | blocks. We use the model formula ap¬ 
proach below. 

The cabbage data from the is da Is package lists the yield of cabbage 
from four different treatments each grown on four different fields (the 
blocks). We are only interested in comparing the treatments but want 
to account for the fact that some of the observations are collected from 
the same fields. 

> library(isdals) 

> data(cabbage) 

> friedman.test(yield ~ method | field, data=cabbage) 

Friedman rank sum test 
data: yield and method and field 

Friedman chi-squared = 11.1, df = 3, p-value = 0.01120 

Friedman's rank sum test rejects the null hypothesis of equal location 
parameter and we conclude that there are significant differences in the 
yields among the four treatment methods (p-value of 0.01120). If we 
look at the individual ranks (or the original values), then we see that 
treatment method "K" has substantially lower yield than the other three 
treatment methods. 

f riedman . test uses an asymptotic normal distribution to calculate 
the p-value. The coin package provides the function f r iedman_test 
which can compute approximate p-values using Monte Carlo resam¬ 
pling. Setting the distribution=approximate () argument to the 
f riedman_test function ensures that approximate test probabilities 
are computed instead of asymptotic probabilities. The number of Monte 
Carlo replications is 1000 by default, but it can be changed by setting 
the B argument to the approximate function; e.g., to use 9999 Monte 
Carlo replications we set distribution=approximate (B= 9999). 
f riedman_test uses the same input as f riedman .test except that 
the "block variable" must be converted to a factor manually if it is not 
already a factor. 

> library(coin) 

> friedman_test(yield ~ method | factor(field), 

+ data=cabbage, distribution=approximate()) 

Approximative Friedman Test 
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data: yield by 

method (A, C, K, N) 
stratified by factor(field) 
chi-squared = 11.1, p-value = 0.001 

We reach the same conclusion from the Monte Carlo approximation of 
the p-value as we did from the asymptotic p-value. 

See also: Use Cochran's test for this type of situation but with bi¬ 
nary responses. Cochran's test can be calculated using the function 
cochran . test from the outlier package. 


Survival analysis 


3.43 Fit a Kaplan-Meier survival curve to event history 
data 

Problem: You want to fit a Kaplan-Meier survival curve to event his¬ 
tory data. 

Solution: Data that measure lifetime or the length of time until the 
occurrence of an event are called failure time, event history, or survival 
time data. For example, the response variable could be the lifetime of 
light bulbs or the survival time of heart transplant patients. A spe¬ 
cial consideration with survival time data is that there often is right- 
censoring present; i.e., we may not know the time of event for all indi¬ 
viduals, but we may know that the event had not occurred at the given 
point in time. Thus each response observation consists of two variables: 
the observed time and a censoring indicator, which tells if the observed 
time point is due to an event or if it has been censored and we only 
know that the event had not happened yet. The purpose of Kaplan- 
Meier survival analysis is to estimate and compare the underlying dis¬ 
tribution of the failure time variable among k groups of individuals. 
The survf it function from the survival package computes Kaplan- 
Meier survival estimates and can be used to create Kaplan-Meier sur¬ 
vival curves for right-censored data. The input to survf it should be 
a model formula where the response is a survival object, and where the 
right-hand side of the formula is a factor that defines the groups. The 
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conf. int option has the default value of 0.95 and can be modified to 
change the confidence limits if desired. 

Before survf it can be used, we need to create a survival object needed 
as response variable by using the Surv function. For traditional right- 
censored data, Surv should be called with two arguments: the first 
argument, t ime, is a vector of survival times and the second argument, 
event, is a vector that indicates if the corresponding time measure¬ 
ment was a proper observation or it was censored. The status indicator 
should be coded as either 0/1 ("1" indicates a real time observation and 
"0" a censored time), TRUE/FALSE (FALSE means the time is censored), 
or 1/2 (where "1" means censored). 

Using plot on the fitted model produces a plot of Kaplan-Meier sur¬ 
vival curves. The mark.time option to plot is set to TRUE by de¬ 
fault and then censoring times that are not event times are marked on 
the survival curves. Setting mark .time to FALSE turns off the marks, 
and setting it to a numeric vector places a mark at every specified time 
point. 

The ovarian dataset from the survival package contains observa¬ 
tions on survival time for two different treatments for ovarian cancer. 
The variables futime, fustat, and rx, represent the observed time, 
censoring status and treatment group, respectively. In the following 
example, we compare the survival distributions for the two treatment 
groups. 


> library(survival) 

> data(ovarian) 

> head(ovarian) 
futime fustat 


age resid.ds rx ecog.ps 


1 59 1 72.3315 21 1 

2 115 1 74.4932 21 1 

3 156 1 66.4658 21 2 

4 421 0 53.3644 22 1 

5 431 1 50.3397 21 1 

6 448 0 56.4301 11 2 

> # Fit Kaplan-Meier curves for each treatment 

> model <- survfit(Surv(futime, fustat) ~ rx, 

> summary(model) 

Call: survfit(formula = Surv(futime, fustat) ~ 
data = ovarian) 


data=ovarian) 

rx, 


rx=l 


ime n. 

.risk n. 

. event 

survival 

std.err lower 

95% Cl upper 

95% Cl 

59 

13 

1 

0.923 

0.0739 

0.789 

1.000 

115 

12 

1 

0.846 

0.1001 

0.671 

1.000 

156 

11 

1 

0.769 

0.1169 

0.571 

1.000 

268 

10 

1 

0.692 

0.1280 

0.482 

0.995 

329 

9 

1 

0.615 

0.1349 

0.400 

0.946 

431 

8 

1 

0.538 

0.1383 

0.326 

0.891 

638 

5 

1 

0.431 

0.1467 

0.221 

0.840 
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Figure 3.18: Kaplan-Meier survival curves for treatment group 1 (dashed line) 
and group 2 (dotted line) for the ovarian data. Censoring times are shown 
with a 


rx=2 


time n. 

.risk n 

.event survival 

std.err lower 

95% Cl upper 

95% Cl 

353 

13 

1 

0.923 

0.0739 

0.789 

1.000 

365 

12 

1 

0.846 

0.1001 

0.671 

1.000 

464 

9 

1 

0.752 

0.1256 

0.542 

1.000 

475 

8 

1 

0.658 

0.1407 

0.433 

1.000 

563 

7 

1 

0.564 

0.1488 

0.336 

0.946 


> plot(model, lty=2:3, xlab=”Days", ylab="Survival") # Plot them 

Figure 3.18 shows the Kaplan-Meier survival curves for the two treat¬ 
ment groups. The summary output from the fitted Kaplan-Meier curves 
shows information on the number of observations under risk and events 
for each group as well as the estimated survival probabilites and their 
pointwise confidence limits. A Kaplan-Meier survival curve for a single 
group can be fitted by using ~ 1 as the right-hand side of the model for¬ 
mula and in that situation a 95% confidence interval for the curve will 
be shown when the survival curve is plotted. 

The log-rank test tests if there is any difference in survival distribution 
among the different groups. The survdif f function computes the log 
rank test statistic and computes the corresponding p-value for the test 
of no difference among groups, survdif f takes the same input as 
survfit. The rho option is a scalar parameter that determines the 
type of test computed by survdif f. The default value of 0 produces 
the log-rank test whereas rho=l changes the test so it is equivalent to 
the Peto and Peto modification of the Gehan-Wilcoxon test. 
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> survdiff(Surv(futime, fustat) ~ rx, data=ovarian) 

Call: 

survdiff(formula = Surv(futime, fustat) ~ rx, data = ovarian) 

N Observed Expected (O-E) A 2/E (O-E) A 2/V 

rx=l 13 7 5.23 0.596 1.06 

rx=2 13 5 6.77 0.461 1.06 

Chisq= 1.1 on 1 degrees of freedom, p= 0.303 

We get a chi-square test statistic of 1.1 and a corresponding p-value of 
0.303 and conclude from the log-rank test that there is no difference in 
survival times between the two treatment groups. 

When the sample size is rather limited, it might be dangerous to rely 
on asymptotic tests. Instead, the surv_test function from the coin 
package can be used to compute exact log-rank tests when argument 
distribution=exact is given. In the following code we need to 
specify that the grouping variable, rx, should be considered a factor 
since surv_test does not automatically convert the variable to a fac¬ 
tor. 

> library(coin) 

> surv_test(Surv(futime, fustat) ~ factor(rx), 

+ data=ovarian, distribution=exact) 

Exact Logrank Test 

data: Surv(futime, fustat) by factor (rx) (1, 2) 

Z = 1.0296, p-value = 0.2974 
alternative hypothesis: two.sided 

The p-value from the exact log-rank test is almost identical to the asymp¬ 
totic result obtained by survdif f and we again conclude that there is 
no difference in survival time between the two treatments. 

See also: The km. ci function from the km. ci package computes point- 
wise and simultaneous confidence intervals for the Kaplan-Meier esti¬ 
mator. 


3.44 Fit a Cox regression model (proportional hazards 
model) 

Problem: You want to fit a Cox regression model (proportional hazards 
model) to event history data. 

Solution: Data that measure lifetime or the length of time until the 
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occurrence of an event are called failure time, event history, or survival 
time data. For example, the response variable could be the lifetime of 
light bulbs or the survival time of heart transplant patients. A spe¬ 
cial consideration with survival time data is that there often is right- 
censoring present; i.e., we may not know the time of event for all indi¬ 
viduals, but we may know that the event had not occurred at the given 
time point. Thus each response observation consists of two variables: 
the observed time and a censoring indicator which tells if the observed 
time point is due to an event, or if it has been censored and we only 
know that the event had not happened yet. The purpose of survival 
analysis is to model the underlying distribution of the failure times and 
to assess the dependence of the failure time variable on the explanatory 
variables. 

Cox's proportional hazard model is widely used for survival analysis 
because it is not based on any assumptions concerning the shape of the 
underlying survival distribution. The proportional hazards model as¬ 
sumes that the underlying hazard rate is a linear function of the ex¬ 
planatory variables but no assumptions are made about the hazard 
function. 

Cox's proportional hazard model can be fitted with the coxph func¬ 
tion from the survival package. The first argument to coxph is a 
model formula, where the explanatory variables are specified on the 
right-hand side of the formula and where the response is a survival ob¬ 
ject created by the Surv function just as for Kaplan-Meier estimation 
(see Rule 3.43). The method option determines how ties are handled. 
The default method is efron and other alternatives are breslow and 
exact. 

The ovarian dataset from the survival package contains observa¬ 
tions on survival time for two different treatments of ovarian cancer. 
The variables futime, fustat, and rx, represent the observed time, 
censoring status and treatment group, respectively. In the following 
example, we start with a model where age is allowed to influence the 
hazard rate differently for the two treatment groups. 


> library(survival) 


> data(ovarian) 

> ovarian$rx <— 

> head(ovarian) 
futime fustat 


159 1 

2 115 1 

3 156 1 

4 421 0 

5 431 1 

6 448 0 


factor(ovarian$rx) 
age resid.ds rx 


72.3315 2 1 

74.4932 2 1 

66.4658 2 1 

53.3644 2 2 

50.3397 2 1 

56.4301 1 1 


# Make sure rx is a factor 

ecog.ps 
1 
1 
2 
1 
1 
2 


> model <- coxph(Surv(futime, fustat) ~ rx*age, data=ovarian) 
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> dropl(model, test="Chisq") 

Single term deletions 

Model: 

Surv(futime, fustat) ~ rx * age 
Df AIC LRT Pr(Chi) 

<none> 59.019 
rx:age 1 58.084 1.0648 0.3021 

> model2 <- coxph(Surv(futime, fustat) ~ rx + age, data=ovarian) 

> summary(model2) 

Call: 

coxph(formula = Surv(futime, fustat) ~ rx + age, data = ovarian) 

n= 26, number of events= 12 

coef exp(coef) se(coef) 
rx2 -0.80397 0.44755 0.63205 -1 

age 0.14733 1.15873 0.04615 3 

Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

exp(coef) exp(-coef) lower .95 upper .95 
rx2 0.4475 2.234 0.1297 1.545 

age 1.1587 0.863 1.0585 1.268 

Rsquare= 0.457 (max possible= 0.932 ) 

Likelihood ratio test= 15.89 on 2 df, p=0.0003551 

Wald test = 13.47 on 2 df, p=0.00119 

Score (logrank) test = 18.56 on 2 df, p=9.34e-05 

The dropl (withoption test = "Chisq" to getp-values) and summary 
functions produce likelihood ratio tests for the explanatory variables 
and parameter estimates, respectively. We find no significant interac¬ 
tion between treatment group and age (p = 0.3021), so a model without 
interaction is specified. The Wald tests in the final model shows that 
age is significant (p = 0.00141) but treatment is not (p = 0.20337), so 
treatment could be removed from the model. The exponentiated coef¬ 
ficients found in the second column for each estimate are the estimated 
hazard ratios; i.e., the multiplicative effects on the hazard. Thus for age, 
we have an exponentiated estimate of 1.158 so each additional year in 
age increases the hazard of death by 15.8%. summary also prints out 
the overall likelihood ratio, Wald and score tests for the hypothesis that 
all parameters are zero. 

The proportionality assumption of the Cox regression model can be 
checked using the cox. zph function from survival, cox . zph tests 
the proportional-hazards assumption for each covariate by comparing 
the (Schoenfeld) residuals to a transformation of time. The default trans¬ 
formation of the survival times is based on the Kaplan-Meier estimate. 


z Pr (> | z | ) 
.272 0.20337 

.193 0.00141 ** 
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but other possibilities are rank and identity which can be set using 
the transform argument. 

> validate <- cox.zph(model) 

> validate 

rho chisq p 
rx2 0.1105 0.1366 0.712 

age -0.1342 0.2164 0.642 

rx2:age -0.0888 0.0879 0.767 
GLOBAL NA 1.5936 0.661 

The proportionality assumption does not appear to be violated for any 
of the covariates since all p-values are large. The global test is an overall 
test of proportionality for all variables while the first three lines list tests 
for the individual explanatory variables. 

The Cox model can be stratified according to a categorical explanatory 
variable by using the strata function in the model formula. Stratify¬ 
ing by a covariate enables us to adjust for the variable without estimat¬ 
ing its effect on the outcome, and without requiring that the covariate 
satisfy the proportional hazards assumption since there is a separate 
underlying baseline hazard for each level of the categorical variable. 
We can use the following model if we wish to stratify the ovarian 
dataset according to women over the age of 60 and women under the 
age of 60. 

> model4 <- coxph(Surv(futime, fustat) ~ rx + strata(age>60), 

+ data=ovarian) 

> summary(model4) 

Call: 

coxph(formula = Surv(futime, fustat) ~ rx + strata(age > 60), 
data = ovarian) 

n= 26, number of events= 12 

coef exp(coef) se (coef) z Pr(>|z|) 

rx2 -0.4300 0.6505 0.6003 -0.716 0.474 

exp(coef) exp (-coef) lower .95 upper .95 
rx2 0.6505 1.537 0.2006 2.11 

Rsquare= 0.02 (max possible= 0.846 ) 

Likelihood ratio test= 0.52 on 1 df, p=0.4707 

Wald test =0.51 on 1 df, p=0.4738 

Score (logrank) test =0.52 on 1 df, p=0.4709 

If the results from the stratified analysis are compared to the previous 
analysis we find that the treatment estimate has changed slightly, but 
that it is still far from significant. Age is included differently in the 
model than before since we now allow the baseline hazard to be differ¬ 
ent for different age groups. 
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See also: Rule 3.45 shows an example of a proportional hazards model 
with time-varying covariates. The survfit function can be used on 
the object returned from coxph to estimate and plot the baseline sur¬ 
vival curves. However, survfit function estimates by default at the 
mean values of the covariates which may not be meaningful. See the 
help page for survfit. coxph — in particular the newdata option — 
for information on how to use sensible values for the covariates. 

The cph function in package rms extends coxph to allow for interval 
time-dependent covariates, time-dependent strata, and repeated events. 
The cumres function in package gof computes goodness-of-fit statis¬ 
tics for the Cox proportional hazards model (see Rule 3.23). 


3.45 Fit a Cox regression model (proportional hazards 
model) with time-varying covariates 

Problem: You want to fit a Cox regression model (proportional hazards 
model) with time-varying covariates to event history data. 

Solution: Cox regression models with time-dependent covariates are 
an extension of the Cox regression model where some of the explana¬ 
tory variables are allowed to change values over time. This could be 
the case for, say, humans where for example exercise, food intake, or 
weight measurements change over time. 

The coxph function from the survival package was used to fit Cox 
regression models with time-constant covariates in Rule 3.44, but coxph 
can also accommodate time-varying covariates. Time-varying covari¬ 
ates are handled when creating the survival object using the Surv func¬ 
tion. The idea is to split up the observation for an individual into many 
smaller left-truncated time intervals, where the covariates are constant. 
If we supply the start time as well as the end time of each interval to¬ 
gether with the event status indicator, then each individual will be part 
of the at-risk population with the given explanatory covariates at each 
interval. For example, if we have an individual who dies at time 600 
days after operation, who have had treatment 1 for the first 100 days, 
treatment 2 for the next 300 days and then treatment 1 again, then that 
person should enter the data frame with three lines as shown in the 
following example data: 

start end sex treatment dead 
0 100 Ml 0 
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Hence we should make sure that each individual is split up according 
to covariate-constant intervals with one interval per row in the data 
frame and make sure that the event status vector is coded correctly. 

We illustrate the use of time-varying explanatory covariates using the 
jasal data from the survival package. The jasal data frame con¬ 
tains information on survival of patients on the waiting list for the 
Stanford heart transplant program. People are followed before and af¬ 
ter receiving the transplant so the transplant explanatory variable can 
change over time. The jasal data frame has already split up individ¬ 
uals in time-constant intervals as can be seen below. 


> library(survival) 

> data(heart) 

> head(jasal) 

id start stop event 


1 

1 

0 

49 

2 

2 

0 

5 

102 

3 

0 

15 

3 

4 

0 

35 

103 

4 

35 

38 

4 

5 

0 

17 


1 

1 

1 

0 

1 

1 


transplant 

0 

0 

1 

0 

1 

0 


age 

-17.155373 
3.835729 
6.297057 
-7.737166 
-7.737166 
-27.214237 


year 

0.1232033 

0.2546201 

0.2655715 

0.4900753 

0.4900753 

0.6078029 


surgery 

0 

0 

0 

0 

0 

0 


> model <- coxph(Surv(start, stop, event) ~ transplant + 

+ year + year*transplant, data=jasal) 

> summary(model) 

Call: 

coxph(formula = Surv(start, stop, event) ~ year * transplant, 
data = jasal) 

n= 170, number of events= 75 


year 

transplant 
year:transplant 


coef exp(coef) 
-0.2657 0.7667 

-0.2868 0.7507 

0.1374 1.1473 


se(coef) z Pr(>|z | ) 

0.1052 -2.526 0.0115 * 

0.5140 -0.558 0.5769 

0.1409 0.975 0.3295 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 


year 

transplant 
year:transplant 


exp(coef) 
0.7667 
0.7507 
1.1473 


exp (-coef) 
1.3043 
1.3321 
0.8716 


lower .95 upper .95 
0.6239 0.9422 

0.2741 2.0556 

0.8704 1.5123 


Rsquare= 0.05 (max possible= 0.97 ) 
Likelihood ratio test= 8.64 on 3 df, 
Wald test =8.16 on 3 df. 
Score (logrank) test =8.46 on 3 df. 


p=0.03441 
p=0.04279 
p=0.03743 


Based on this model (without any further model reductions), we see 


www.Ebook777.com 


194 


The R Primer 



Figure 3.19: Survival curves from a Cox regression model with time-varying 
covariates. Solid lines are before transplantation, and dashed lines are after 
transplantation. 


that the hazard is decreasing as the years increase for both the no¬ 
transplant and the transplant group (estimates of —0.2657 and —0.2657+ 
0.1374 = —0.1283 per increase in year, respectively). There seems to be 
a positive effect of transplant since the corresponding estimate is nega¬ 
tive and hence decreases the hazard. However, that effect is not statis¬ 
tically significant. 

Just as for time-constant covariates, cox. zph can be used to test the 
proportional hazards assumption and the strata function stratifies 
the model. The survfit function can be used to generate survival 
curves from the Cox model with time-varying covariates. 

> cox.zph(model) 

rho chisq p 

year 0.0117 0.0111 0.916 

transplant -0.0490 0.2001 0.655 

year:transplant 0.0370 0.1102 0.740 

GLOBAL NA 0.4484 0.930 

> model2 <- coxph(Surv(start, stop, event) ~ transplant:year 

+ year + strata(transplant), data=jasal) 

> summary(model2) 

Call: 

coxph(formula = Surv(start, stop, event) ~ strata (transplant) + 
transplant:year + year, data = jasal) 

n= 170, number of events= 75 


coef exp(coef) se(coef) 
year -0.2766 0.7584 0.1057 
year:transplant 0.1510 1.1629 0.1427 


z Pr (> | z | ) 
-2.618 0.00885 

1.058 0.29012 


•k 
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Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

exp(coef) exp(-coef) lower .95 upper .95 
year 0.7584 1.3186 0.6165 0.9329 

year:transplant 1.1629 0.8599 0.8792 1.5382 

Rsquare= 0.051 (max possible= 0.954 ) 

Likelihood ratio test= 8.94 on 2 df, p=0.01147 

Wald test =8.57 on 2 df, p=0.01378 

Score (logrank) test =8.97 on 2 df, p=0.01126 

> plot(survfit(model2), conf.int=TRUE, lty=l:2, 

+ xlab="Time (days)", ylab="Survival") 

The proportional hazards assumption appear to be valid, and the sur¬ 
vival curves are shown in Figure 3.19. The large confidence intervals 
shown on the survival curves match the non-significant effect of trans¬ 
plantation even though the two survival curves are non-overlapping. 

See also: See the timecox function in the timereg package for Cox 
regression models with time-varying parameters. 
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Chapter 4 

Graphics 


R is rich on graphical functions and it is very easy to produce high 
quality graphics. Base R plotting uses an ink-on-paper approach where 
high-level graphics functions (e.g., plot, hist, and boxplot) initial¬ 
ize a plot or create an entire plot while low-level graphics functions 
(e.g., lines, title, and legend) add to the current plot. The ink- 
on-paper approach means that once something is added to the plot it 
cannot be removed, and any desired customizations to the plot must be 
included in the call(s) to the graphics function(s) that produce the plot. 
The base R graphics has been extended with several packages which 
extend or redefine graphics, most notably the lattice package which 
produces trellis graphics, and ggplot2 which produces graphics using 
graphical grammar. Both packages are highly useful and well worth 
learning, but they are beyond the scope of this book. Interested read¬ 
ers are referred to the excellent books by Sarkar (2008) and Wickham 
(2009). 

Most of the high-level plotting functions can directly customize many 
of the graphical parameters through common arguments. If graphical 
parameters are set as arguments to a plotting function, then the changes 
are only in effect for the given call to that function. Alternatively, graph¬ 
ical parameters can be set through the par function in which case they 
will be in effect for the rest of the R session or until they are changed 
again. 

The help page for par shows a detailed list of all graphical parameters. 
Some of the most commonly used graphics parameters accepted by the 
majority of plotting functions are: 

• x 1 ab and y 1 ab which set the character strings for the x and y axis 
labels, respectively. 

• col and bg which change the plotting color and background plot¬ 
ting color (see Rule 4.2). 

• pch which changes the plotting symbol for points. 
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Figure 4.1: Examples of plotting symbols and line types in R plots. 


• lty and lwd which change the line type and line width when 
plotting lines. Default line width is 1. 

• main which is a character string for the plot title. 

• xlim and ylim set the range of the x and y axes, respectively, 
and each accepts a pair of numbers for the lower and upper limit 
of the axis. 

• las which sets the style of the axis tick mark labels and should 
be an integer from 0-3 corresponding to parallel to the axis (the 
default), always horizontal, always perpendicular to the axis, and 
always vertical, respectively. 

• log which is used to request logarithmic axes. It can take the 
character strings "x", "y", and "xy" which refer to logarithmic 
x axis, logarithmic y axis, and double-logarithmic axes, respec¬ 
tively. 

The plotting symbol, pch, and line types, lty, take integer values as 
shown in Figure 4.1, but the plotting character can also be specified di¬ 
rectly; e.g., pch="a" which uses the character a as plot symbol. Note 
that for plotting symbols 21 through 25 it is possible to set different bor¬ 
der colors (using the col option) and fill colors (using the bg option). 
R will recycle graphical values so it is easy to change, for example, plot¬ 
ting symbols and colors for different points. 

Text and symbol size are controlled through the cex family of argu¬ 
ments. cex sets the amount of scaling that is used when plotting sym¬ 
bols and text relative to the default, cex defaults to 1, and a value of 
1.5 means that symbols and text should be 50% larger, a value of 0.5 
is 50% smaller, etc. The options cex. axis, cex . lab, cex. main, and 
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cex . sub are used to set the magnification for axis annotations, labels, 
main title, and sub-title, respectively, relative to the cex value. 

The following code illustrates how graphic parameters are customized 
and changed. The output is not shown. 


> x <- 1:5 

> y <- x A 2 

> plot(x, 

> plot(x, 

> plot(x, 

> plot(x, 

+ col : 

> # Plot c 

> plot(x, 

> plot(x, 

> plot(x, 

> plot(x, 

> par(lwd= 

> par(cex. 

> plot(x, 

> plot(x, 

> plot(x, 


y, pch=3:7) 
y, pch=c("a", 
y, col="red") 
y, pch=c("a" / "b"), 
c("red", "blue", " 


# Generate data 

# Generate data 

# Use symbols 3-7 for the points 

"b")) # Use symbols a b a b a 

# Each symbol is red 

# Symbols are a b a b a in colors 
green")) # red blue green red blue 


ircles with a thick red border and blue fill color 
y, pch=21, lwd=3, col="red", bg="blue") 

# Symbols with increasing size 

# Line with line type 2 

# Line with line width 3 

# Line width 3 and symbol 16 

# Large labels and larger title 

# x axis goes from 0 to 10 

# Logarithmic y axis 

# Rotate numbers on y axis 


y, cex=seq(1,2,.25)) 
y, type="l", lty=2) 
y, type="l", lwd=3) 
3, pch=16) 
lab=2, cex.main=3) 
y, xlim=c(0, 10)) 
y, log="y") 
y, las=l) 


The text function adds a text string at the coordinate given by the x 
and y arguments. The argument labels should be a character vector 
or expression that contains the text to be written. Text can be written to 
one of the four margins of a plot using the mtext function. The first ar¬ 
gument to mtext should be a character vector or expression containing 
the text to write. The side argument determines on which margin the 
text is printed: 1 is bottom, 2 is left, 3 is top (the default), and 4 is right. 
Both text and mtext have additional arguments that change the text 
adjustment and give more control over the exact text placement. Note 
that when adding code to the margins of a plot it may be necessary to 
force R to include more space in the margin. This is done with the mar 
option to the par function (see Rule 4.21 for an example). 

> text(l, 2, "Where's Wally?") # Add text to existing plot 

> mtext("I'm out here", side=4) # and text to the right margin 


4.1 Including Greek letters and equations in graphs 

Problem: You want to add Greek letters or equations in titles, labels or 
text on a plot. 

Solution: Greek letters, super- or subscript, and mathematical equa¬ 
tions are easy to include in titles, labels or text-drawing functions. If 
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Table 4.1: Examples of formats possible with text-drawing functions 


R code 

Interpretation 

Example 

x[i] 

subscript 

Xi 

x A 2 

superscript 

X 2 

x%h—% y 

plus / minus 

x±y 

x%*%y 

times 

x x y 

x% . %y 

center dot 

x ■ y 

x! =y 

not equal 

x^y 

x<=y 

less than or equal 

x<y 

alpha .. omega 

Greek letters 

a .. uj 

hat(x) 

hat 

X 

bar(x) 

bar 

X 

x%->%y 

right arrow 

x->y 

sqrt(x) 

square root 

\fx 


the text argument to one of the functions that draw text is an expres¬ 
sion, then the argument is formatted according to TpX-like rules. 

Table 4.1 lists examples of the formatting that can be used in text-draw¬ 
ing functions. Run demo (plotmath) to see a thorough demonstration 
of the possibilities with expressions in R text. 

The paste function can be used inside the expression function to 
paste regular text and mathematical expressions together. The follow¬ 
ing lines of code show some of the possibilities, and the resulting output 
can be seen in Figure 4.2. 

> x <- seq(0, 5, .1) 

> # Plot sine and square root curve and add x axis label 

> plot(x, sin(x), type="l", ylim=c(-l, 2 . 5 ), 

+ xlab=expression(paste("Concentration ", mu[i]))) 

> lines(x, sqrt(x), lty=2) 

> title(expression(paste("This looks like ", Gamma, 

rho, epsilon, epsilon, kappa, " to me"))) 

> # Place equations at specific positions 

> text(4, 1.5, expression(hat(x) == sqrt(x))) 

> text(1.6, -.5, expression(paste(plain(sin)(x) == 

+ sum(frac((-1) A n, paste((2*n+l), plain("!")))*x A (2*n+l), 

+ n==0, infinity)))) 

Note that expressions do not work for axis labels on plots created by 
the persp function. 

See also: The help page for plotmath and demo (plotmath) gives a 
complete list of all possible mathematical annotations in R. 
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This looks like rpesK to me 



Figure 4.2: Examples of Greek letters and mathematical expressions in R plots. 


4.2 Set colors in R graphics 

Problem: You want to use a special color in your R graphics. 

Solution: Colors can be specified in R graphics using either a character 
string that defines the color name or a hexadecimal character string that 
gives the red, green, and blue intensities. If colors are given as hexadec¬ 
imal strings, then they should have the form trrggbb, where rr, gg, 
bb are hexadecimal numbers giving the red, green, and blue intensities, 
respectively. 

The colors function lists all the predefined colors found in R. The 
graphical output from the following function calls is not shown 


> colors () 


[1] 

"white" 

"aliceblue" 

"antiquewhite" 

[4] 

"antiquewhitel" 

"antiquewhite2" 

"antiquewhite3 

[7] 

"antiquewhite4" 

"aquamarine" 

"aquamarinel" 

[649] 

"wheat3" 

"wheat4" 

"whitesmoke" 

[652] 

"yellow" 

"yellowl" 

"yellow2" 

[655] 

"yellow3" 

"yellow4" 

"yellowgreen" 


> plot(1:10, col="cyan") # Plot using cyan color 

> plot(1:10, col="#cc0076") # Plot using purplish color 

Color specification can also be made with the rgb or hsv functions. 
The rgb function returns a hexadecimal character string and takes three 
numerical arguments red, green, and blue, that give the intensities 
of the three colors. The maxColorValue argument defaults to 1, but 
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can be set to, say, 255, to let the individual color intensities represent 
8-bit numbers. 

rbg can also generate transparent colors with the alpha argument, 
alpha takes a number between zero and maxColorValue and gives 
the degree of transparency. Zero means fully transparent while the 
maximum value means opaque. Transparent colors are supported only 
on some devices. 

> rgb(10, 20, 30, maxColorValue=255) 

[1] "#0A14IE" 

> rgb(10, 20, 30, maxColorValue=100) 

[1] "#1A334D" 

> rgb(10, 20, 30, alpha=50, maxColorValue=100) 

[1] "#1A334D80" 

hsv works like rgb except it specifies colors using the hue-saturation- 
value color principle. 

See also: Rule 4.3 shows how to specify color palettes. 


4.3 Set color palettes in R graphics 

Problem: You want to view or set a complete color palette. 

Solution: Colors can be specified in R graphics using character strings 
as shown in Rule 4.2. Colors can also be specified through color palettes, 
which is the vector of colors that are used when the color argument in 
function calls has a numeric index. 

The palette function views or sets the color palette. If no arguments 
are given to palette then the function lists the current palette. If the 
character string "default " is given as argument, then palette will be 
set to the default palette, and if a character vector of colors are given 
then the palette function will set the current palette to the vector of 
colors. 

> palette() # Show the current palette 

[1] "black" "red" "green3" "blue" "cyan" "magenta" 

[7] "yellow" "gray" 

> palette(c("red", "#1A334D", "black", rgb(.l, .8, 0))) 

> palette() 

[1] "red" "#1A334D" "black" "#1ACC00" 

> palette("default") # Use the default palette 

Full color palettes can be generated directly without having to spec¬ 
ify each color separately by using the RColorBrewer package. The 


www.Ebook777.com 



Graphics 


203 


RColorBrewer has a single function, brewer . pal, which makes color 
palettes from ColorBrewer available in R. brewer .pal takes two ar¬ 
guments, n which is the number of different colors in the palette, and 
name which is a character string which should match one of the avail¬ 
able palette names. The available palette names can be seen in the 
brewer . pal. inf o data frame. The output from brewer, pal can 
be used directly as input to the relevant plotting function through, for 
example, the col or bg arguments. Alternatively, the palette returned 
by brewer . pal can be used with palette to set the global palette. 


> library(RColorBrewer) 

> head(brewer.pal.info, n=10) 



maxcolors 

category 

BrBG 

11 

div 

PiYG 

11 

div 

PRGn 

11 

div 

PuOr 

11 

div 

RdBu 

11 

div 

RdGy 

11 

div 

RdYIBu 

11 

div 

RdYIGn 

11 

div 

Spectral 

11 

div 

Accent 

8 

qual 


> brewer.pal(n=5, name="Spectral") # 5 colors from Spectral 

[1] "#D7191C" "#FDAE61" "#FFFFBF" "#ABDDA4" "#2B83BA" 

> brewer.pal(n=10, name="Accent") # 10 colors from Accent 

[1] "#7FC97F" "#BEAED4" "#FDC086" "#FFFF99" "#386CB0" "#F0027F" 
[7] "#BF5B17" "#666666" 

Warning message: 

In brewer.pal(n = 10, name = "Accent") : 

n too large, allowed maximum for palette Accent is 8 
Returning the palette you asked for with that many colors 

> # Set colors locally for just one function call 

> plot(1:8, col=brewer.pal(n=8, name="Reds"), pch=16, cex=4) 

> palette(brewer.pal(n=8, name="Greens")) # Set global palette 

> plot (1:8, col=l:8, pch=16, cex=4) # Plot using colors 


The examples on the help page for brewer. cal should be run to see 
and appreciate the beautiful palettes. 

See also: The web page http : //colorbrewer . org has more infor¬ 
mation on ColorBrewer palettes. R has several built-in functions like 
gray, rainbow, terrain . colors, and heat, colors that return 
color palettes with prespecified hues. 
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High-level plots 


4.4 Create a scatter plot 

Problem: You wish to create a scatter plot to show the relationship 
between two quantitative variables. 

Solution: Scatter plots are crucial for data presentation and explana¬ 
tory data analysis and are crucial in order to reveal relationships or as¬ 
sociations between two numeric variables. 

The versatile plot function produces scatter plots when two numerical 
vectors of the same length are provided as input. Plot labels for the x 
and y axes are set with the xlab and ylab options, respectively, and 
the main argument can be used to set the plot title. The plotting symbol 
and color are set with the pch and col options, respectively. 

The pairs function produces a matrix of scatter plots where all possi¬ 
ble scatter plots between two variables are shown, pairs require either 
a matrix or data frame as input or a model formula where the terms in 
the formula are used to select the variables plotted with pairs. 

For example to see all possible scatter plots for the airquality dataset 
we use the code below, which first produces a single scatter plot of 
ozone against temperature and then subsequently plots all possible pair¬ 
wise scatter plots. The results are shown in Figure 4.3. 

> data(airquality) 

> plot(airquality$Temp, airquality$Ozone, 

+ xlab="Temperature", ylab="Ozone", 

+ main="Airquality relationship?") 

> pairs(airquality) 

The pairs plot shows the relationship between any pair of variables 
twice. For example, the upper right plot in the right panel of Fig¬ 
ure 4.3 shows ozone level against day number, while the lower left 
scatter plot in the right panel of Figure 4.3 shows day number against 
ozone level. However, the user can customize pairs such that differ¬ 
ent plots are shown below, on, and above the diagonal, by supplying 
pairs with the function(s) to generate the necessary plots. The argu¬ 
ments lower.panel and upper.panel should be functions which 
accept parameters x, y, and . . . while the diagonal function given to 
argument diag. panel should accept parameters x and .... Com¬ 
mon to all three functions is that they must not start a new plot but 
should plot within a given coordinate system. Hence, functions such as 
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Figure 4.3: Single scatter plot (left) and all possible scatter plots (right) for the 

airquality dataset. 


plot, boxplot, and qqnorm cannot be used directly for panel func¬ 
tions. 

In the following code we use three different panel functions to make 
the pairs plot seen in Figure 4.4. Histograms and empirical density 
curves (computed using the density function) are plotted in the di¬ 
agonal with the panel. hist function. The panel. r2 function is a 
slight modification of the panel.cor function defined on the pairs 
help page, and it calculates and prints the squared correlation, R 2 , with 
text size depending on the proportion of explained variation. Finally, 
we use the built-in function panel. smooth to produce scatter plots 
overlayed with a smoothing function to show the general trend. 

> # Create function to make histogram with density superimposed 

> panel.hist <- function(x, ...) { 

> # Set user coordinates of plotting region 

> usr <- par("usr"); on.exit(par(usr)) 

> par(usr = c(usr[l:2], 0, 1.5)) 

> par(new=TRUE) # Do not start new plot 

> hist(x, prob=TRUE, axes=FALSE, xlab="", ylab="", 

+ main="", col="lightgray") 

> lines(density(x, na.rm=TRUE)) # Add density curve 

> } 

> # Create function to compute and print R A 2 

> panel.r2 <- function(x, y, digits=2, cex.cor, ...) { 

> # Set user coordinates of plotting region 

> usr <- par("usr"); on.exit(par(usr)) 

> par(usr = c(0, 1, 0, 1)) 

> r <- cor(x, y, use="complete.obs")**2 # Compute R^2 

> txt <- format(c(r, 0.123456789), digits=digits)[1] 

> if(missing(cex.cor)) cex.cor <- 1/strwidth(txt) 

> text(0.5, 0.5, txt, cex = cex.cor * r) 
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Figure 4.4: Modified pairs plot for (some of) the variables in the air quality 
dataset. Histograms and density curves are shown in the diagonal. Upper pan¬ 
els show the proportion of explained variation, R 2 , between the two variables 
and lower panels show scatter plots overlayed with a smoothing function. 


> } 

> 

> pairs (~ Ozone + Temp + Wind + Solar.R, data=airquality, 

+ lower.panel=panel.smooth, upper.panel=panel.r2, 

+ diag.panel=panel.hist) 

See also: Figure 4.1 shows different plotting options while Rule 4.5 gives 
an example of histograms. 


4.5 Create a histogram 

Problem: You want to summarize the distribution of a quantitative 
dataset graphically. 

Solution: A histogram allows us to graphically summarize the distri¬ 
bution of a quantitative set of observations; e.g., the center, spread, and 
number of modes in the data. 

Histograms and relative frequency histograms are both produced with 
the hist function. By default, the hist function automatically groups 
the quantitative data vector into bins of equal width and produces a 
frequency histogram. 

The LakeHuron dataset contains information on the annual measure- 
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Figure 4.5: Frequency histogram (left plot) and relative frequency histogram 
with unequal bin size (right plot). 


ments in feet of the level of Lake Huron in the period from 1875 to 1972. 
The code below produces a standard frequency histogram as shown in 
the left plot of Figure 4.5. 

> data(LakeHuron) 

> hist(LakeHuron) 

The number of bins is controlled by the breaks option to hist. If 
breaks is not entered, then R will try to determine a reasonable num¬ 
ber of bins. The number of bins is determined by the user if the breaks 
option is set to an integer value, and the exact breakpoints are set when 
breaks is a vector of breakpoint values (note that the breakpoint val¬ 
ues should include the entire range of observations). 

If the bins are of unequal widths, then hist by default produces a rel¬ 
ative frequency histogram. Setting the f req=FALSE option forces the 
hist function to produce relative frequency histograms even when the 
bin widths are identical. 

The col, density and angle options are used to set the fill color of the 
histogram bars, the density and slope of shading lines, respectively. If 
density is set to create shading lines, then the col option determines 
the color of the shading lines, hist includes a default title. It can be 
changed with the main option and the title is removed altogether if 
main=NULL. 

The density function computes the empirical density function and 
it can be added directly to a relative frequency histogram using the 
lines function. 

> # Use the breaks option to specify the number of bins 
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> # regardless of the size of the dataset 

> # Remove title and make bars gray 

> hist(LakeHuron, xlab="Level of Lake Huron (in feet)", 

+ freq=FALSE, breaks=c(574, 577, 578, 581, 585), 

+ col="gray", main=NULL) 

> lines(density(LakeHuron)) 


4.6 Make a boxplot 

Problem: You wish to make a box-and-whisker plot to summarize one 
or more distributions graphically. 

Solution: Boxplots are efficient graphics for examining both a sin¬ 
gle distribution and for making comparisons between several distri¬ 
butions. Boxplots provide a visual summary of several aspects of a 
distribution, including its central value, general shape and variability. 
A boxplot shows the extreme (maximum and minimum values), the 
lower and upper quartiles, and the median. 

Horizontal and vertical boxplots are produced in R by the boxplot 
function and input to boxplot can either be a single numeric vector 
(for a single distribution) or a formula of the form y ~ group where 
y is a numeric vector of values and the vector group determines the 
grouping (for parallel boxplots). Parallel boxplots are also created if 
several numeric vectors are used as arguments to boxplot. 

By default, R produces a modified boxplot where extreme values are 
graphed as separate points. Thus, the whiskers extend to the largest 
data points that are not extreme (extreme observations or outliers are 
defined as values that are further than 1.5 x the inter-quartile range 
away from the lower or upper quartile). 

To compare the distribution of teeth lengths for different doses of the 
ToothGrowth data we can use the following code, which produces 
the left-hand graph of Figure 4.6. 

> data(ToothGrowth) 

> teeth <- subset(ToothGrowth, supp=="VC") # Use only some data 

> boxplot(len ~ factor(dose), data=teeth) 

We can see that the distribution corresponding to dose 1 appears to 
have slightly smaller variance than the other two doses and that there 
is a single extreme value for the dose 1 observation. The distribution of 
values for all three doses appears to be fairly symmetric. 

The standard boxplot where the whiskers extend to the minimum and 
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Figure 4.6: Parallel boxplots (left plot) and a single horizontal boxplot where 
the whiskers extend to the minimum and maximum value (right plot). 


maximum value can be obtained by setting the range option to zero. 
In addition, the horizontal=TRUE option ensures that the produced 
boxplot is horizontal. The following code produces the right-hand plot 
of Figure 4.6. 

> # Horiz. boxplot with whiskers from minimum to maximum value 

> attach(ToothGrowth) # Use all data 

> boxplot(len, horizontal=TRUE, range=0) 


4.7 Create a bar plot 

Problem: You want to present your categorical data or results as a bar 
plot. 

Solution: Categorical data can be visualized as bar plots, which are 
produced by the barplot function in R. When the first argument to 
barplot is a vector, then the default plot consists of a sequence of bars 
with heights corresponding to the elements of the vector. If the first 
argument is a matrix, then each bar is a segmented bar plot where each 
column of the matrix corresponds to a bar and the matrix values define 
the height of the elements of each stacked bar. 

barplot creates stacked bars by default but that can be changed to jux¬ 
taposed bars by setting the beside option to TRUE. The legend. text 
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option constructs a legend for the bar plot. If legend. text is set to a 
vector of strings then these strings will correspond to the legend labels. 
Alternatively, legend. text can be set to TRUE in which case the row 
names of the input matrix are used to create the legend labels. If a vec¬ 
tor of character strings is given to the names option, then the names 
are plotted underneath each bar or group of bars. Finally, the horiz 
argument defaults to FALSE but can be set to TRUE if the bars are to be 
drawn horizontally. 

We need to divide each column in the matrix by the column sum if the 
stacked bar plot should show relative frequencies. The prop.table 
function computes the relative frequencies of a matrix given as the first 
argument. The second argument to prop .table determines if the pro¬ 
portions are relative to the row sums (margin=l) or the column sums 
(margin=2). 

As an example, we use the following data from the county of Arhus in 
Denmark where the concentration of the pesticide dichlorobenzamide 
(also called BAM) in drinking water was examined from two differ¬ 
ent municipalities within the county. The allowable limit of BAM is 
0.10/jg/l, and the data are summarized below. 


Municipality 

Concentration (in 
< 0.01 0.01-0.10 

bg/1) 

> 0.10 

Total 

Hadsten 

23 

12 

6 

41 

Hammel 

20 

5 

9 

34 


The following example produces the two plots shown in Figure 4.7. We 
consider the relative frequencies and use the t function to transpose 
the matrix such that frequencies are grouped by municipality instead 
of concentration. 

> m <- matrix(c(23, 20, 12, 5, 6, 9), ncol=3) 

> relfrq <- prop.table(m, margin=l) 

> relfrq 

[,1] [/ 2] [, 3] 

[1,] 0.5609756 0.2926829 0.1463415 
[2,] 0.5882353 0.1470588 0.2647059 

> # Make juxtaposed barplot 

> barplot(relfrq, beside=TRUE, 

+ legend.text=c("Hadsten", "Hammel"), 

+ names=c("<0.01", "0.01-0.10", ">0.10")) 

> # Stacked relative barplot with labels added 

> barplot(t(relfrq), names=c("Hadsten", "Hammel"), horiz=TRUE) 


# Input data 

# Calc rel. freqs 


See also: Rule 4.8 shows how to add error bars to bar plots. 
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Figure 4.7: Juxtaposed and stacked relative bar plots. 


4.8 Create a bar plot with error bars 

Problem: You want to present your data or results as a bar plot over- 
layed with error bars. 

Solution: The barplot2 function from the gplots package can be 
used to create a bar plot with error bars. barplot2 is an enhancement 
of the standard barplot function that can plot error bars for each bar. 
The function takes the same arguments as barplot but in addition you 
can specify the lower and upper limits for the individual error bars with 
the ci . 1 and ci . u options, respectively. 

If we want to create a bar plot of the average monthly temperature and 
corresponding intervals where we can expect to find 95% of the ob¬ 
servations of the population (assuming normality and known monthly 
temperature means) based on the air quality dataset, then we can 
use the following code, where we set the argument plot. ci=TRUE to 
make barplot2 plot the intervals. 

> library(gplots) 

> data(airquality) 

> # Compute means and standard deviations for each month 

> mean.values <- by(airquality$Temp, airquality$Month, mean) 

> sd.values <- by(airquality$Temp, airquality$Month, sd) 

> barplot2(mean.values, 

> xlab="Month", ylab="Temperature (F)", 

> plot.ci=TRUE, 

> ci.u = mean.values+1.96*sd.values, 

> ci.1 = mean.values-1.96*sd.values) 
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Month Month 


Figure 4.8: Examples of barplot2 output. The left panel shows the expected 
intervals for 95% of the observations for each month. The right graph shows 
(half) the 95% confidence intervals for the mean of each month. 


The code above produces the graph seen in Figure 4.8. The line color, 
type and width of the error bars are controlled with the ci. color, 
ci . lty, and ci . lwd, respectively, and they take the usual values re¬ 
lated to line plotting. 

We can also use the barplot2 function to show standard error of pa¬ 
rameter estimates as shown in the following code. The output is the 
right-hand graph of Figure 4.8, and by setting the lower confidence in¬ 
terval limit to the same value as the mean value we print only the part 
of the confidence intervals that is above the bars. We use coef and 
summary to extract the estimates and standard errors and the qt func¬ 
tion to calculate 95% confidence limits directly. 

> airquality$Month <- factor(airquality$Month) 

> model <- lm(Temp ~ Month - 1, data=airquality) 

> mean.values <- coef(summary(model))[,1] 

> mean.values 

Month5 Month6 Month7 Month8 Month9 
65.54839 79.10000 83.90323 83.96774 76.90000 

> sem <- coef(summary(model))[,2] 

> scale <- qt(0.975, df=summary(model)$df[2]) 

> barplot2(mean.values, 

> xlab="Month", ylab="Temperature (F)", 

> plot.ci=TRUE, 

> ci.u = mean.values+scale*sem, 

> ci.1 = mean.values) 


See also: The plot CI function in the gp lots package. plotCIisused 
in Rule 4.9. 
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4.9 Create a plot with estimates and confidence intervals 

Problem: You want to present the results from your analysis graphi¬ 
cally with estimates and corresponding confidence intervals. 

Solution: Sometimes it is desirable to present the results from an 

analysis graphically, for example by illustrating parameter estimates 
and their corresponding confidence intervals. These graphs can be cre¬ 
ated using standard plotting functions like plot and arrows, but the 
plot Cl function from gplots package combines the necessary steps 
into a single function. 

plotCI requires one or two arguments, x and y which define the coor¬ 
dinates for the center of the error bars, y defaults to the vector 1: n if it 
is omitted. The uiw and liw options set the widths of the upper/right 
and lower/left error bar, respectively, and if liw is not specified it de¬ 
faults to the same value as for uiw. 

The OrchardSprays data frame contains the results from an exper¬ 
iment undertaken to assess the potency of various constituents of or¬ 
chard sprays in repelling honeybees. We use a one-way analysis of 
variance to compare the treatment means and we fit the model without 
intercept so we directly obtain the group means and standard errors 
from R. 

The output from plotCI can be seen in Figure 4.9. In the following 
code we suppress the printing of the .T-axis when calling plotCI since 
we would like to add the actual coefficient names directly to the graph. 
Hence we extract the names of the coefficients from the fitted object and 
add them manually using the axis function. 

> library(gplots) 

> data(OrchardSprays) 

> fit <- lm(decrease ~ treatment - 1, data=OrchardSprays) 

> summary(fit) 

Call: 

lm(formula = decrease ~ treatment - 1, data = OrchardSprays) 


Residuals: 

Min IQ Median 3Q Max 

-49.000 -9.500 -1.625 3.812 58.750 


Coefficients: 

Estimate Std. 
treatmentA 4.625 

treatmentB 7.625 

treatmentC 25.250 
treatmentD 35.000 


Error t value Pr(>|t|) 

7.253 0.638 0.526308 

7.253 1.051 0.297663 

7.253 3.481 0.000975 *** 

7.253 4.825 1.12e-05 *** 
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Figure 4.9: Example of plot Cl output. 


treatmentE 

63.125 

7.253 

8.703 

5.47e-12 

k k k 

treatmentF 

69.000 

7.253 

9.513 

2.71e-13 

k k k 

treatmentG 

68.500 

7.253 

9.444 

3.49e-13 

k k k 

treatment!! 

90.250 

7.253 

12.443 

CO 

I—1 

1 

(U 

CM 

V 

k k k 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

Residual standard error: 20.52 on 56 degrees of freedom 
Multiple R-squared: 0.8887, Adjusted R-squared: 0.8728 

F-statistic: 55.89 on 8 and 56 DF, p-value: < 2.2e-16 

> plotCI(coef(summary(fit))[,1], 

+ uiw=qt(.975, df=56)*coef(summary(fit))[,2], 

+ ylab="Estimates and 95% confidence intervals", xaxt="n") 

> axis(1, 1:length(rownames(coef(summary(fit)))) , 

+ rownames(coef(summary(fit)))) 

See also: The plotmeans function from the gplots package provides 
a wrapper to plotCI to plot group means and confidence intervals. 


4.10 Create a pyramid plot 

Problem: You want to create a pyramid (opposed horizontal bar) plot 
to display two groups opposite each other. 

Solution: The pyramid.plot function from the plotrix package 
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can plot a pyramid plot (opposed horizontal bar plot) typically used to 
illustrate population pyramids. 

pyramid.plot takes two arguments, lxand rx, each of which should 
be either a vector, matrix or data frame of the same length. The values 
contained in the lx and rx arguments determine the bar lengths of the 
left and right part of the plot, respectively. If lx and rx are both matri¬ 
ces or data frames then pyramid .plot produces opposed stacked bar 
plots with the first column of the matrix/data frame plotted innermost 
for each bar. 

The labels option sets the labels for the categories represented by 
each pair of bars, and it should be a vector containing a character string 
for each lx or rx value, even if empty. Different labels for the left and 
right bars can be used if labels is a matrix or data frame (in which 
case the first two columns are used for the labels). The options lxcol 
and rxcol can be used to set the colors of the left and right bars, re¬ 
spectively. If a vector of colors is provided, then pyramid.plot will 
cycle through them from the bottom of the pyramid and upwards. 

In this example we plot a population pyramid using 10-year age inter¬ 
vals for the Danish population based on the Danish registry on January 
1, 2011. The number of men and women are both normalized to get 
the percentage of men (and women) at each age group. The output is 
shown in the left plot of Figure 4.10. 


> library(plotrix) 


> men <- c(334700, 357834, 328545, 367201, 411689, 

+ 359363, 336841, 178422, 72296, 9552, 139, 0) 

> women <- c(318662, 340357, 320462, 365180, 401521, 

+ 357405, 345884, 208057, 117843, 27914, 761, 0) 

> groups <- paste(seq(0, 110, 10), seq(9, 119, 10), sep="") 


> groups 

[1] "0-9" "10-19" "20-29 

[7] "60-69" "70-79" "80-89 

> men <- 100*men/sum(men) 

> women <- 100*women/sum(women) 


# Show group labels 

30-39" "40-49" "50-59" 

90-99" "100-109" "110-119 

# Normalize men 

# and women 


> pyramid.plot(men, women, labels=groups, 

+ lxcol="lightgray", rxcol="lightgray") 


The defaults options for pyramid. plot are set up to create population 
pyramids. However, the function can also be used for other opposed 
bar plots with a little tweaking of the optional arguments. The gap 
option sets the width between the opposing bars and it may need to 
be increased if the labels are wider than the age groups typically used 
for population pyramids. The unit option sets the labels on the x axis 
(default is Finally, the t op. 1 abe 1 s option is a character vector of 

length three that sets the top left, middle, and right labels, respectively. 
Its default values are "Male", "Age", and "Female". 

Below we use the survey data frame from the MASS package to com- 
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Male Smoking FrequencyFemale 



85 61 37 13 0 21 45 69 

Number of students Number of students 


Figure 4.10: Examples of pyramid. plot output. 


pare the male and female student frequencies of exercise for different 
groups of smoking status. The output is seen in the right plot of Fig¬ 
ure 4.10. 


> library(MASS) 

> data(survey) 

> # Tabulate exercise and smoking status by gender 

> result <- by(survey, survey$Sex, 

+ function(x) {table(x$Smoke, x$Exer)}) 

> result 

survey$Sex: Female 

Freq None Some 
Heavy 302 

Never 39 10 50 

Occas 513 

Regul 203 


survey$Sex: Male 

Freq None Some 
Heavy 411 

Never 47 8 34 

Occas 721 

Regul 714 

> pyramid.plot(lx=result$Male, rx=result$Female, 

+ labels=row.names(result$Male), 

+ gap=10, unit="Number of students", 

+ top.labels=c("MaleSmoking Frequency","Female"), 

+ lxcol=c("darkgray", "white", "lightgray"), 

+ rxcol=c("darkgray", "white", "lightgray")) 
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cbind(x1, x2fix) 


Figure 4.11: Example of mat plot output. 


4.11 Plot multiple series 

Problem: You want to create a graph where multiple series (points or 
lines or both) are plotted. 

Solution: The matplot function can be viewed as an extension of the 
plot function which can plot multiple series of points or lines simulta¬ 
neously. matplot takes two arguments, x and y each of which can be 
vectors or matrices with the same number of rows. 

The first column of x is plotted against the first column of y, the second 
column of x against the second column of y, etc. matplot will cycle 
through the columns if one matrix has fewer columns than the other. 
The left panel of Figure 4.11 is created by the following two lines: 

> x <- seq(0, 4*pi, .1) 

> matplot(x, cbind(sin(x), cos (x), sin(x/2)), type="l") 

matplot takes the same graphical parameters as plot. In particu¬ 
lar, type can be used to set the plot type (type="l" for lines and 
type="p" for points) and the parameters pch, lty, and lwd are used 
to set the plotting symbol, line type and line width, respectively. As 
always, these parameters are recycled if their length is shorter than the 
number of series plotted. 

Missing values are not plotted and can be used to create series with a 
different number of observations as shown below and in the right-hand 
panel of Figure 4.11. 

> xl <- 1:6 
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> yl <- xl**2 

> x2 <- seq(1.5, 4.5, 1) 

> x2 

[1] 1.5 2.5 3.5 4.5 

> x2fix <- c(x2, NA, NA) 

> y2 <- sqrt(x2fix) 

> matplot(cbind(xl, x2fix), cbind(yl, y2) ) 

If only one of x and y is specified, then x will be set to the vector 1: n, 
and the other vector will act as y. 

See also: mat lines and matpoints are matrix plot versions of the 
lines and points functions, respectively, matlines adds lines to an 
existing plot while matpoints adds points. 


4.12 Make a 2D surface plot 

Problem: You want to create a 2D contour plot of a surface. 

Solution: R has several built-in functions that produce two-dimensio¬ 
nal plots of a surface. The contour function displays isolines of a ma¬ 
trix z, where the elements of 2 are interpreted as heights with respect to 
the xy plane. 

The following code makes a contour plot for the function sin(x • y) eval¬ 
uated over the rectangle [0,3] x [0,4]. The result is shown in the upper 
left panel of Figure 4.13. 

> x <- seq(0, 3, .2) # Set grid points for x 

> y <- seq(0, 4, .1) # Set grid points for y 

> # Compute function at all grid points 

> z <- outer(x, y, FUN=function(xx,yy) { sin(xx*yy) }) 

> contour(x, y, z) # Make contour plot 

The location of the grid lines given by the x and y parameters must 
be in ascending order and they are set to be equidistant on the interval 
from 0 to 1 if they are not specified. The z matrix is interpreted as a 
table of f(xi,yj) values, so the x and y axes in the plot produced by 
contour corresponds to the rows and columns of z, respectively. 

The number of contour isoline levels is set by the nlevels options 
(default is 10) which produce equidistant isolines. Alternatively, spe¬ 
cific isolines can be set by supplying a vector of desired isolines to the 
levels option. This is shown below and the result is in the upper right 
panel of Figure 4.12. 

> contour(x, y, z, levels=c(-l, -.3, 0, .1, .2, .3, .8, 1)) 
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Figure 4.12: Examples of contour, filled, contour and image output. The 
upper panels show contour, the lower left plot is filled, contour and the 
lower right panel is image. 


Plain contours are not very striking and often the surface is more eas¬ 
ily envisioned if colors are used to fill the spaces between the contours. 
The filled, contour function creates a contour plot where the ar¬ 
eas are filled in solid color and adds a color map that shows the re¬ 
lationship between the colors and the isolines. This type of plot is 
sometime also called a "level plot." The color .palette option to 
filled, contour can set the color scheme. 

Finally, you can use the image function to plot your matrix without it 
being contoured, image needs slightly different input than contour 
and filled. contour. If the length of x equals nrow (z) +1 then the 
values of x are interpreted as the boundaries between the cells. Alter¬ 
natively, if the length of x equals nrow ( z ) then the values of x specify 


www.Ebook777.com 



















220 


The R Primer 


the midpoints of the cells. The same applies to y. Also, for image the 
option to set the color scheme is col so the following code produces 
the output shown in the lower plots of Figure 4.12. 

> filled.contour(x,y,z,color.palette=gray.colors) 

> image(x,y,z,col=gray.colors(10) ) 

See also: The contourplot from the lattice package uses formula 
notation to specify contour plots, i.e., z ~ x*y. The functions rainbow, 
terrain . colors, heat. colors, and cm. colors are used to create 
color palettes with different color schemes. 


4.13 Make a 3D surface plot 

Problem: You want to create a 3D graph of a surface. 

Solution: The persp function can be used to draw perspective plots 
of surfaces represented by a matrix z, where the elements of z are inter¬ 
preted as "heights" with respect to the xy plane. 

The following code creates a perspective plot for the function sin (a; • y) 
evaluated over the rectangle [0,3] x [0,4], The result is shown in the 
upper left panel of Figure 4.13. 

> x <- seq(0, 3, .2) # Set grid for x dimension 

> y <- seq(0, 4, .1) # Set grid for y dimension 

> z <- outer(x, y, FUN=function(xx,yy) { sin (xx*yy) }) 

> persp(x, Yr z ) 

The function arguments phi and theta are used to rotate the viewing 
angle of the surface in the azimuthal and colatitude directions, respec¬ 
tively. The plot can often be substantially improved by setting a reason¬ 
able view point using these two options. The z matrix is interpreted as 
a table of f(xi , yj) values, so the x and y axes in the plot correspond to 
the rows and columns of z, respectively. Thus, with the default values 
of phi and theta, the top left corner of the matrix is displayed at the 
left-hand side, closest to the user. 

Three additional arguments are worth mentioning here: col, shade 
and ticktype, which set the surface color, the degree of shading, and 
the tick marks along the axes, respectively, col is the color of the sur¬ 
face facets, and can be set to a single color (which is recycled), or a 
longer vector of colors, shade should be a number between 0 and 1 
which sets the level of shading, and ticktype can be set to the char¬ 
acter string "detailed" to provide tick marks along the axes. The 
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Figure 4.13: Examples of persp output. The upper right sets rotate the plot 
and set the color to light gray. The lower left graph adds shading to the plot 
and the lower right provides tick marks. 


default for ticktype is "simple" which just draws arrows parallel 
to the axes. 

These options are used below and the output is seen in Figure 4.13. 


> 

persp(x. 

Yf 

z, theta=40. 

phi=45. 

col=' 

'lightgray") 



> 

persp(x. 

Yr 

z, theta=40, 

phi=45, 

col=' 

' lightgray", 

shade= 

.6) 

> 

persp(x. 

Yr 

z, theta=40. 

phi=45, 

col=' 

'lightgray", 

shade= 

.6, 


+ ticktype="detailed") 

See also: The help page of persp gives an example of how to make the 
surface colors correspond to the values of the z matrix. The wireframe 
function from the lattice package can also produce surface plots. 
Rule 4.27 uses the persp3d function from the rgl package to create 
interactive 3D surface plots. 
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4.14 Plot a 3D scatter plot 

Problem: You want to make a 3D scatter plot of points. 

Solution: Visualizing a three-dimensional point cloud in just two di¬ 
mensions can be difficult but a few functions exist that try to accom¬ 
plish that. Here we will look at the scatterplot3d function from the 
scatterplot3d package for plotting three-dimensional scatter plots. 
scatterplot3d works like plot but takes three arguments, x, y and 
z, which define the three coordinates of the points. The y and z options 
can be omitted if x is data frame with numeric variables cols, rows, 
and value or if x is a data frame that contains the variables x, y and z. 
The type option sets the type of plot: p for points, h for vertical lines 
and 1 for lines. The vertical lines are especially useful for indicating the 
"depth" of a three-dimensional point since both the "height" and the 
position on the x — y plane is shown. 

The color of the points/lines can be given by the col option. Alter¬ 
natively, the highlight. 3d option can be set to TRUE to override the 
color option, and makes sure that the points are drawn in different col¬ 
ors depending on their depth (i.e., the value of y coordinate). 

The relationship between volume, height and girth for the trees data 
frame is plotted by the following code using the s catterplot 3d func¬ 
tion. The output is shown as the upper two plots of Figure 4.14. 

> library(scatterplot3d) 

> data(trees) 

> attach(trees) 

> scatterplot3d(Girth, Height, Volume) # Scatter plot 

> scatterplot3d(Girth, Height, Volume, type="h", # Vertical lines 

+ color=grey(31:1/40)) 

The scale . y and angle options set the depth of the plotting box (rel¬ 
ative to the width and height), and the angle between the x and y axis, 
respectively. Default values are 1 and 40. 

The object returned by scatterplot3d provides the two functions, 
points3d and plane3d, which are used to add graphics to the cur¬ 
rent 3D plot. points3d adds points or lines (much like the points 
and lines functions) while plane 3d draws a plane into the existing 
plot. Both functions include information on the graphical setup and 
plot dimensions to ensure that the new graphical elements are correctly 
added. points3d works just like points and requires three numeric 
vectors as input. plane3d either needs options Intercept, x . coef, 
and y . coef for the three parameters that define the plane in 3D, or the 
resulting object from a lm call with two explanatory variables. The first 
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Figure 4.14: Examples of scatterplot3d output for the trees data. The 
upper left plot is the default 3D plot, while the upper right plot shows the use 
of vertical lines (type="h"). The lower left plot shows how scale.y and 
angle changes the plot, and the lower right plots adds a 3D plane. 


explanatory variable is taken to be the x axis while the other variable is 
the y axis. 

> # Order trees according to volume and then 

> # plot curve from lowest to highest volume 

> o <- order(trees$Volume) 

> scatterplot3d(Girth[o], Height[o], Volume[o], type="l", 

+ scale.y=3, angle=75) 

> myplot <- scatterplot3d(Girth, Height, Volume, type="h", 

+ angle=60) 

> model <- lm(Volume ~ Girth + Height) # Fit a model 

> myplot$plane3d(model) # Add fitted plane 

> detach(trees) 

The code produces the two lower plots shown in Figure 4.14. 
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See also: The plot3d from the rgl package produces interactive 3D 
plots (see Rule 4.27). 


4.15 Create a heat map plot 

Problem: You wish to create a heat map plot to visualize data from a 
two-dimensional array. 

Solution: Heat map plots are used to visualize data from a two-dimen¬ 
sional array using colors to indicate the values in the array. In R, the 
function heatmap plots a heat map, and it requires a numeric matrix as 
its first argument. The distfun argument determines which function 
is used to calculate distances between the row vectors and it defaults 
to the function dist. hclustfun defaults to the function hclust and 
determines which function is used to compute the hierarchical cluster¬ 
ing of the rows and columns. The options Rowv and Colv can be set to 
NA to suppress printing of row and column dendrograms, respectively. 
The scale option indicates if the matrix values should be scaled. Its 
default value is row if the matrix is not symmetric, and none otherwise, 
which corresponds to values being centered and scaled in the row di¬ 
rection or not at all, respectively. If scale="column" then the values 
are centered and scaled in the column direction. 

The eurodist data frame contains distances between major European 
cities and the distances are used below to illustrate the heatmap func¬ 
tion. The output is seen in the left panel of Figure 4.15. 

> data(eurodist) 

> x <- as.matrix(eurodist) 

> heatmap(x, Rowv=NA, Colv=NA) 

Notice how the row and column names of the matrix x are automat¬ 
ically included in the plot to make it easier to identify the individual 
rows and columns. 

The color scheme of the heat map plot can be set with the col option. 
By default it uses heat colors obtained from the heat. colors func¬ 
tion, but that can be changed to give other color schemes. If the options 
ColSideColors and RowSideColors are provided, they should con¬ 
tain a vector of color names for a horizontal/vertical side bar that an¬ 
notates the columns and rows of the data matrix, respectively. This can 
be used to indicate known groupings by color code as shown below. 

In the following, we would like to add side bars that indicate which 
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Figure 4.15: Examples of heatmap output. The right-hand plot has added ver¬ 
tical and horizontal side bars. 


cities are from the same countries. We do that by providing a vector 
of colors corresponding to the countries for the individual city names. 
The output is seen in the right-hand graph of Figure 4.15. 


> colnames(x) 

[1] "Athens" 

"Barcelona" 

"Brussels" 

[4] 

"Calais" 

"Cherbourg" 

"Cologne" 

[7] 

"Copenhagen" 

"Geneva" 

"Gibraltar" 

[10] 

"Hamburg" 

"Hook of Holland" 

"Lisbon" 

[13] 

"Lyons" 

"Madrid" 

"Marseilles 

[16] 

"Milan" 

"Munich" 

"Paris" 

[19] 

"Rome" 

"Stockholm" 

"Vienna" 

> # 

Add a color bar 

to the heatmap 



> countries <- c("Greece", "Spain", "Belgium", "France", 

+ "France", "Germany", "Denmark", "Switzerland", "Gibraltar", 

+ "Germany", "Holland", "Portugal", "France", "Spain", "France", 

+ "Italy", "Germany", "France", "Italy", "Sweden", "Austria") 

> f.countries <- factor(countries) 

> group.colors <- heat.colors(nlevels(f.countries)) 

> country.color <- group.colors[as.integer(f.countries)] 

> heatmap(x, ColSideColors = country.color, 

+ RowSideColors = country.color) 

See also: The heatmap . 2 function from the gplots package extends 
the standard heatmap function by adding several extra options includ¬ 
ing the possibility to see a color key for the values in the heatmap. The 
functions rainbow, terrain . colors, and cm. colors are some of 
the possible choices for color schemes. For red/green heat maps that 
are commonly used to plot microarray expression data you can use the 
redgreen function from the gplots packages to generate the color 
scheme. 
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4.16 Plot a correlation matrix 

Problem: You have a covariance matrix or a matrix of correlations and 
you want to visualize it graphically. 

Solution: A correlation plot displays correlations between the vari¬ 
ables in a dataset or between the estimates from a fitted model. Corre¬ 
lations can be depicted as ellipses and the degree of correlation is shown 
by the shape and/or color of the ellipses. Each variable is perfectly cor¬ 
related with itself (so the diagonal of the correlation plot will always 
consist of straight lines), and perfect circles indicate no correlation. 

The plotcorr function from the ellipse package plots a correlation 
matrix using ellipses for each entry in a correlation matrix, plotcorr 
expects a matrix with values between -1 and 1 as its first argument. If 
the numbers argument is set to TRUE then the actual correlation co¬ 
efficients are rounded (and multiplied by 10) and printed instead of 
ellipses. The type option determines if the full (the default), lower 
or upper triangular matrix is plotted. The diag=FALSE option sup¬ 
presses printing of the diagonal elements (when type=" full") while 
setting diag=TRUE adds the diagonal elements (when type="upper " 
or when type="lower"). 

Below, we plot the correlations between different observations for the 
air quality data. The second example overlays two correlation plots 
so the actual correlation coefficients are printed in the upper triangular 
matrix and the ellipses are printed in the lower triangular matrix. Here, 
the vcov function is used in combination with cov2cor to extract the 
variance-covariance matrix of the estimated parameters from a linear 
model and convert it to a correlation matrix. 

> library(ellipse) 

> data(airquality) 

> # cor only works on complete cases so we remove missing 

> complete <- airquality[complete.cases(airquality), ] 

> plotcorr(cor(complete)) 

> # Get lung capacity data and fit a model 

> library(isdals) 

> data(fev) 

> model <- lm(FEV ~ Ht + I(Ht A 2) + Gender + Smoke + Age, 

+ data=fev) 

> corr2 <- cov2cor(vcov(model)) # Correlation matrix 

> plotcorr(corr2, type="lower", diag=TRUE) 

> par(new=TRUE) # Keep 1st plot 

> plotcorr(corr2, type="upper", diag=TRUE, numbers=TRUE) 

The left plot of Figure 4.16 shows the correlation between the measure- 
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Figure 4.16: Correlation plot for airquality measurements (left graph) and 
for parameter estimates from the f ev dataset. 

merits and show that, for example, wind and ozone are negatively cor¬ 
related, while temperature and ozone are positively correlated. In the 
right plot of Figure 4.16 we see that all variables but height are un¬ 
correlated. The strong negative correlation between height and height 
squared and the intercept is caused by collinearity in the data. 

See also: Color can be added to the ellipses to help differentiate between 
the magnitudes of correlation. This is done with the col option. See the 
help page for an example of how to add colors. 


4.17 Make a quantile-quantile plot 

Problem: You wish to make a quantile-quantile (q-q plot) to graphi¬ 
cally compare the distribution of two datasets or to compare the distri¬ 
bution of one dataset to a known distribution. 

Solution: Quantile-quantile plots (q-q plots) are used to compare the 
distribution function of two datasets by plotting the quantiles of one 
dataset against the quantiles of the other dataset. Shifts in location, 
scale, or changes in symmetry can be seen from the q-q plot, and the 
points on the q-q plot should fall approximately along a 45-degree line 
if the two datasets come from the same distributions. 

The q-q plot is often used for model validation to check if the sample 
observations come from a specific distribution by comparing the quan- 
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Figure 4.17: Normal quantile-quantile plots for data from a normal distribution 
(left) and from a t distribution (right). 


tiles of the dataset with the theoretical quantiles from the hypothesized 
distribution. In particular, checking for normality is done by compar¬ 
ing the sample quantiles to the expected theoretical quantiles from a 
standardized normal distribution. If the points in the normal q-q plot 
are on a straight line, then the sample observation appears to be from 
a normal distribution with a standard deviation corresponding to the 
slope of the points, and a mean that corresponds to the intercept of the 
points. 

The qqplot function compares the distribution of two data samples 
and requires two quantitative vectors as input arguments — one for 
each dataset. The two vectors do not have to have the same length. 
The normal q-q plot is implemented by the qqnorm function which re¬ 
quires a single numeric vector of quantitative observations as input. 
The qqline adds a comparison line that passes through the first and 
third quartiles to a normal q-q plot, and it requires the same numeric 
vector of observations as input as was used for generating the q-q plot 
with qqnorm. 

If we observe substantial departures from a straight line — either in 
the tails, the middle or both — then we conclude that the fitted nor¬ 
mal model is not appropriate for the data. An "inverse S" shape of the 
points indicate that the sample distribution has heavier tails than the 
normal distribution used for comparison while an "S" shape suggests 
lighter tails. If both tails are above the comparison line, then this is an 
indication that the observations are skewed to the right. If both tails are 
below the comparison line, then the distribution of the observations is 
left-skewed. 


www.Ebook777.com 




Graphics 


229 


We illustrate the use of normal q-q plots below for two situations: when 
data are from the normal distribution and from the t distribution (which 
looks like the standardized normal distribution but has heavier tails). 
The result is shown in Figure 4.17. 


> x <- rnorm(400, mean=5, sd=2) 

> y <- rt(400, df=4) 

> qqnorm(x) 

> qqline(x) 

> qqnorm(y) 

> qqline(y) 


# Normal data 

# Data from t distribution 

# Create normal q-q plot 

# Add comparison line 

# Create normal q-q plot 

# Add comparison line 


The points in the left plot of Figure 4.17 follow a straight line so noth¬ 
ing suggests that they do not come from a normal distribution (roughly 
with a mean of 5 and a standard deviation of 2 as seen by the intercept 
and slope of the line). The right plot in Figure 4.17 suggests that the ob¬ 
servations come from a distribution with heavier tails than the normal 
distribution since the points roughly follow an "inverse S" shape. 


4.18 Graphical model validation for linear models 

Problem: You wish to validate a linear normal model graphically to 
verify if the underlying assumptions appear to be fulfilled. 

Solution: Linear normal models play a major role in statistical analysis 
and model validation is an important step for all analyses. Flere we will 
focus on graphical methods for model validation since they illustrate a 
broad range of aspects of the relationship between the model and the 
data. Diagnostic plots provide a way to check for heteroscedasticity, 
normality and influential observations. 

Linear models are fitted in R using the lm function and the generic 
plot function (which calls the plot. lm function for an lm object) pro¬ 
vides several standard plots for model validation. 

The which option determines which model diagnostics plots are pro¬ 
duced. There are six possible plots: a plot of residuals against fitted val¬ 
ues, a scale-location plot of \f\ residuals | against fitted values, a normal 
q-q plot, a plot of Cook's distances versus row labels, a plot of residuals 
against leverages, and a plot of Cook's distances against leverage/(1- 
leverage), and by default plots 1, 2, 3, and 5 are produced. 

We use the cherry tree data, trees to illustrate the use of plot. lm for 
graphical model validation. The output is seen in Figure 4.18. 
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> data(trees) 

> result <- lm(Volume ~ Height + Girth, data=trees) 

> plot(result, which=l:6) 

The six plots produced by plot can be used as follows: 

Residuals vs. Fitted. The residuals should be randomly and uniformly 
scattered about zero and they should be independent of the fitted 
(predicted) values which can be checked on the "Residual vs. Fit¬ 
ted Values" plot. A lack of uniform scatter around zero suggests 
that there is still some structure left in the data that is not captured 
by the model and that a more complex model and /or a transfor¬ 
mation of some of the variables may be relevant. The solid line 
shows a smoothed average level of the residuals. 

Normal Q-Q. The normal quantile-quantile plot compares the distri¬ 
bution of the standardized residuals with the expected theoretical 
quantiles from a standardized normal distribution, and provides 
information on whether the residuals are realistic to come from a 
normal distribution. See Rule 4.17 for details about the q-q plot. 

Scale-Location. The scale-location plot can be used to check for vari¬ 
ance homogeneity and in particular to see if the variance is chang¬ 
ing with the mean fitted value. The size of the square root of the 
absolute residuals should be approximately constant if there is 
variance homogeneity, and the solid line indicates a smoothed av¬ 
erage of the square root of the absolute standardized residuals. 

Cook's Distance. Cook's distance is a measure of the influence of a sin¬ 
gle observation and can be used to identify observations that are 
highly influential and/or may be potential outliers. Cook's dis¬ 
tance measures how much the fitted values change when each ob¬ 
servation is left out of the analysis; substantial changes to the fit¬ 
ted values means that the fit is dominated by the observation. The 
"Cook's distance" plot graphs Cook's distance values for each ob¬ 
servation. As a rough rule-of-thumb any observation with a value 
of 1 or more is particularly worth checking for validity as is any 
observation with a Cook's distance substantially larger than other 
observations. 

Residuals vs. Leverage. Observations with both high leverage and large 
residuals can have a huge impact on the fitted model and the cor¬ 
responding estimates. The "Residuals vs. Leverages" plot can be 
used to identify potential outliers, observations with high lever¬ 
age, or a combination of both. Ideally, the standardized residuals 
should be scattered around zero when compared to the leverage. 
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Figure 4.18: Model diagnostic plots for fit of a linear model to cherry tree data. 
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The smoothed average is indicated by the solid line and contour 
levels of Cook's distance are shown by dashed lines. 

The "Residuals vs. Leverages" plot contains the same informa¬ 
tion as the Cook's distance plot, but at the same time enables the 
viewer to determine if influential points are caused by large resid¬ 
uals, large leverages, or both. 

Cook's Distance vs. Leverage. In the "Cook's distance vs. Leverage" 
plot the "Residuals vs. Leverage" plot has been turned around. 
Instead of plotting the standardized residuals against the leverage 
and contours for Cook's distance, it plots Cook's distance against 
the leverage and contours for standardized residuals. 

The "Residuals vs. Fitted" plot from Figure 4.18 suggests that there 
is some kind of uncaptured structure in the data that is not captured 
by the multiple linear relationship between the volume and the girth 
and height. The standardized residuals follow a straight line in the 
normal q-q plot which suggests that the standardized residuals follow 
a normal distribution, and the scale-location plot is fairly constant and 
does not suggest variance heterogeneity. None of the observations have 
particularly high values of Cook's distance although observation 31 has 
a relatively large value of Cook's distance. This is caused by the fact that 
the tree girth and height of observation 31 are both larger than any of 
the other observations which is why the leverage of that observation is 
relatively high. The "Residuals vs. Leverage" and "Cook's distance vs. 
Leverage" plots both show that observation 31 is relatively influential 
but that the standardized residuals generally are fairly constant with 
increasing leverage. 

The standardized residual plot plots the standardized residuals against 
the fitted values (and usually also against each of the explanatory vari¬ 
ables) and it is another common plot used for graphical model valida¬ 
tion. Ideally, the standardized residuals should be randomly dispersed 
around the horizontal axis with approximately 95% of the standardized 
residuals in the interval from [—1.96; 1.96]. We should look for vari¬ 
ance homogeneity (even spread of observations for all fitted values), 
the overall structure (the average should be centered around zero), and 
possible observations with a very large standardized residual just as 
described above. Plots of the standardized residuals against each of 
the explanatory variables serve to check if the explanatory variables 
enter the model correctly. This is especially useful for quantitative ex¬ 
planatory variables, where any systematic structure in a residual vs. 
explanatory variable plot suggests that the explanatory variable should 
enter the model in a more complex way. 
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Figure 4.19: Standardized residual plot. 


We can plot a standardized residual plot directly using the following 
code (where we only include a plot for one of the two explanatory vari¬ 
ables): 

> plot(fitted(result), rstandard(result), 

+ xlab="Fitted values", ylab="Standardized residuals") 

> plot(trees$Height, rstandard(result), 

+ xlab="Height", ylab="Standardized residuals") 

The standardized residual plot in Figure 4.19 shows the same overall 
result as the "Residuals vs. Fitted" plot from Figure 4.18 above. The 
standardized residual vs. height plot in Figure 4.19 has the standard¬ 
ized residuals randomly spread around the horizontal axis so there is 
nothing that suggests that height needs to enter the model in a more 
complex way than linear. 


More advanced graphics 


4.19 Create a broken axis to indicate discontinuity 

Problem: You want to include a break mark on your axis to indicate a 
discontinuity. 

Solution: The axis . break from the plotrix package can add break 
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Figure 4.20: Example of axis . break output. 


marks on the axis of an existing plot. There are two different break mark 
styles (the default "slash", and "zigzag") that are set with the style 
option. The axis, position of the break, and relative width of the break 
are controlled with the axis, breakpos, and brw options, respectively. 

> library(plotrix) 

> plot(2:6) 

> # Default slash break point on the x axis 

> axis.break() 

> # Zigzag break point at 3.5 on y axis, set width 

> # to 0.05 of plot width 

> axis . break (axis=2, breakpos=3.5, brw=0.05, style =,, zigzag" ) 

The output is seen in Figure 4.20. By default, axis . break creates the 
break point on the x axis (axis=l) close to the y axis. Due to R's de¬ 
fault ink-on-paper printing approach the break mark is created by first 
printing a bit of background color and then subsequently printing the 
break mark. That also means that if you use a non-default background 
color for your plots you should use the bgcol option in the call to 
axis . break to set the correct background color. 


4.20 Create a plot with two y -axes 

Problem: You want to create a plot with two different y- axes. 
Solution: While it is easy to create a plot in R that has two different y 
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Figure 4.21: Example of plotting two y -axes on a single graph. 


axes it is generally not recommend to use two y axes as it is too easy to 
mislead the viewer. The idea is to superimpose the second graph cor¬ 
responding to axis two on the first plot and then use the axis function 
to add an extra axis. 

We can allow a second plot on an existing graph by preventing R from 
cleaning the frame and starting a new graphics with the par func¬ 
tion. Calling par (new=TRUE) after the first plot does exactly that and 
ensures that the second plot is created on top of the first. The func¬ 
tion twoord. plot from the plotrix package automates the process 
somewhat. As a minimum, twoord.plot should be given four argu¬ 
ments the x and y values for the first plot and the x and y values for the 
second plot (corresponding to the options lx, ly, rx and rx, respec¬ 
tively). The labels on the x axis, left y axis and right y axis are set by the 
options xlab, ylab and rylab, respectively, and special labels on the 
x axis (like dates) can be set using the xticklab option. The following 
plot shows monthly Danish soft drink sales and traffic accidents in the 
same graph, and the output is seen in Figure 4.21. 


> library(plotrix) 

> # Enter data 

> softdrinks <- c(19.13, 15.94, 22.96, 28.70, 49.75, 76.53, 

+ 70.15, 89.29, 56.12, 44.01, 31.89, 25.51) 

> accidents <- c(557, 527, 566, 598, 631, 657, 

+ 640, 733, 724, 663, 678, 627) 

> twoord.plot(1:12, softdrinks, 1:12, accidents, xlab="Month", 

+ ylab="Danish soft drink sales (million liters)", 

+ rylab="Danish traffic accidents", 

+ xtickpos = 1:12, 

+ xticklab=month.name[1:12]) 
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See also: The doubleYScale function from the latticeExtra pack¬ 
age also allows for two y axes. 


4.21 Rotate axis labels 

Problem: You want to make a plot where some of the axis labels are 
rotated. 

Solution: Text can be added to a graph using the text function and 
we will use this function to manually add rotated axis labels. The mar¬ 
gin text function, mtext, which we would normally use to add text to 
the margins of the current figure cannot be used for this as it does not 
support the string rotation option srt. 

By default, R, assigns space around the figure region so it has room for 
most labels. The amount of space can be changed with the mar option 
to the par function, and we may need to provide extra space for the 
rotated labels if they take up more space after rotation. 

To add rotated axis labels, we first set the option xaxt = " n " to suppress 
the default axis labels when the graph is created. Then we use the axis 
function with option labels=FALSE to add an axis with tick marks but 
without labels before we finally add the label to the plot using text. 
When text is called we use the options srt to set the text rotation 
angle, adj to set the horizontal and/or vertical adjustment of the la¬ 
bel text, and xpd=TRUE to prevent R from clipping the text at the fig¬ 
ure region. adj = l places the right end of the text at the tick marks. 
In the code below we use the optional argument y to place the labels, 
par ("usr") [3] returns the lower coordinate of the plotting region 
and we use that to identify the location of the x axis. The value 0.1 can 
then be adjusted to move the labels up or down relative to the x axis. 


> # Set extra space for rotated labels 

> par(mar = c(7, 4, 1, 1) + 0.1) 

> # Create plot with no x axis and no x axis label 

> x <- 1:6 

> plot(x, sin(x), xaxt="n", xlab="", pch=l:6) 

> axis(l, labels=FALSE) 

> labels <- c("Are", "you", "master", "of", "your", "domain?") 

> # Add labels at default tick marks 

> text(x, y=par("usr")[3] - .1, srt=45, adj=l, 

> labels=labels, xpd=TRUE) 

> # Add axis label at line 5 

> mtext(1, text="Added label for x-axis", line=5) 
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Figure 4.22: Example of rotated labels on the a>axis. 


The output from the above code can be seen in Figure 4.22. 

See also: The help page for par gives information on all the graphical 
parameters. 


4.22 Multiple plots 

Problem: You want to create a figure that consists of multiple sub¬ 
figures. This can be used to make complex plot arrangements with 
plotting regions of varying sizes. 

Solution: The mf row and mf col options to par can partition the plot¬ 
ting region into a number of equally sized rows and columns defined 
by the values of mf row or mf col. 

The layout function allows for more complicated partitions of the 
plotting area, layout uses a matrix as input to imitate the total area 
of the plotting device and determine how the area is partitioned. The 
input matrix should contain integer values, and cells with the same 
value are combined into a single plotting region. Values used to de¬ 
fine the plotting regions determine the order in which they are filled. 
The layout. show function shows the current layout and the numbers 
indicate the order in which plots are added to the graph. The follow¬ 
ing layout prepares a plotting region that spans the entire width of the 
device on top, a blank lower left area and a second plotting area to the 
lower right that spans half the device width. 


www.Ebook777.com 




238 


The R Primer 


i 


2 



Figure 4.23: Examples of layout. show and layout output. 


> design <- matrix(c(1,1,0,2), 

> design 

[,1] [,2] 

[If] 1 1 

[ 2 , ] 0 2 

> layoutl <- layout(design) 

> layout.show(layout1) 

> x <- exp(rnorm(100)) 

> y <- (rnorm(100)+3)**2 

> plot(x, y) 

> hist(x) 


2, 2, byrow=TRUE) 

# Show the layout matrix 


# Simulate data 

# Simulate more data 

# Make a scatter plot 

# and a histogram 


The output is shown in the two plots found in Figure 4.23. Note how 
the output from the layout. show function matches the matrix used 
as input to the layout function. The first plot spans the entire width of 
the plotting device by using the value ‘V for both values in the first row 
of the input layout matrix. More complicated patterns can be achieved 
by clever selection of the matrix elements. 

By default, layout assumes that all rows have the same height and all 
columns have the same width. In principle, we could control the exact 
size of the individual plotting regions by using a matrix with a large 
number of rows and columns. However, the widths and heights op¬ 
tions allow us to control the relative widths and heights of the columns 
and rows, respectively. In the following we make a 2 x 2 layout where 
the top row fills 25% of the vertical plotting region and the right-hand 
column fills 25% of the horizontal plotting region. The options widths 
and heights require vectors with lengths that match the number of 
columns and number of rows, respectively, and the vector values deter¬ 
mine the relative size of each column or row. 
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Figure 4.24: Examples of layout output where the relative width of the rows 
and columns are controlled. 


> layout2 <- layout(matrix(c(1, 0, 2, 3 ), nrow=2, byrow=TRUE), 

+ widths=c(3, 1), 

+ heights=c(l, 3)) 

> # Remove some of the white space around the plots 

> par(mar=c(4, 4.3, 0, 0) +.1) 

> plot(density(x), main="", xlab="") 

> plot(x, y) 

> boxplot(y) 

The result is seen in Figure 4.24. 


4.23 Add a legend to a plot 

Problem: You have created a graph and want to add a legend to the 
plot. 

Solution: The legend function adds a legend to an existing plot. The 
x and y arguments determine the position of the (upper left corner of 
the) legend and the legend option is a character or expression vector 
that contains the text for the legend. 

If any of the usual graphics parameters lty or pch are supplied, then 
a line or plotting symbol is placed in front of each line of the legend 
text, respectively Likewise, col and lwd modify the color and the line 
width, respectively. 

The following code plots the CO 2 uptake profile as a function of CO 2 
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Figure 4.25: Example of legend function output. 


concentration for each of 12 plants from two different areas and adds a 
legend to the plot. The output is seen in Figure 4.25. 

> data(C02) 

> matplot(matrix(C02$conc, ncol=12), matrix(C02$uptake, ncol=12), 

+ type="l", lty=rep(c(1,2), c(6,6)), 

+ xlab="Concentration", ylab= n Uptake") 

> legend(750, 12, c("Quebec", "Mississippi"), lty=c(l,2)) 

Several other options exist that can change the appearance of legend 
— for example setting horiz=TRUE sets the legend horizontally in¬ 
stead of vertically. 

See also: Another example of matplot can be found in Rule 4.25. 


4.24 Add a table to a plot 

Problem: You have created a graphics plot and want to add a table to 
the plot. 

Solution: The addtable2plot function from the plotrix package 
can add a table to an existing plot. Data for the table can be a data frame 
or matrix and the table is added in the same way a legend is added to a 
plot. 

The coordinates of the table location are given as the parameters x and 
y, and the parameters bty, bg, and title set the box type surrounding 
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Figure 4.26: Example of addtable2plot output. 


the table, the background color and the title, respectively. These three 
options work the same way as they do for the plot function. If the op¬ 
tion display . rownames is set to TRUE then rownames are displayed 
in the table. The output from the following code is seen in Figure 4.26. 

> library(plotrix) 

> cats <- matrix(c(53, 115, 17, 502, 410, 14),ncol=2) 

> rownames(cats) <- c("Yes", "No", "No info") 

> colnames(cats) <- c("Problems", "No problems") 

> cats 


Problems No problems 


53 

115 

17 


502 

410 

14 


Yes 

No 

No info 


> # Calculate the group-wise relative frequencies 

> cattable <- cats / apply(cats, 1, sum) 

> barplot(t(cattable), beside=TRUE) 

> addtable2plot(4.6, .8, cats, display.rownames=TRUE, bty="o". 


bg="lightgray", title="Cat behaviour data") 


+ 


4.25 Label points in a scatter plot 

Problem: You want to make a graph where you label points in a scatter 
plot. 

Solution: Individual labels can be put on a plot using the text func¬ 


tion. If you wish to change the plotting symbol and/or color according 
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Figure 4.27: Individual symbols, labels and colors on points. 


to a grouping variable, then we can do that directly through the param¬ 
eters pch and col/bg, respectively. In particular, we set pch, col, or 
bg to a vector of symbols and colors and uses the grouping variable as 
index to the vectors. 

The C02 data frame contains information on ambient dioxide concen¬ 
trations and carbon dioxide uptake for two origins of the grass species 
Echinochloa crus-galli. The Type variable provides the two origins: Que¬ 
bec and Mississippi. The following code produces a graph where ob¬ 
servations from the two origins are labeled as white or black squares. 
The output can be seen in the left panel of Figure 4.27. 

> data(C02) 

> plot(C02$conc, C02$uptake, pch=22, 

+ bg=c("black", "white")[as.numeric(C02$Type)], 

+ xlab="Concentration", ylab="Uptake") 

Here we use that Type was a factor and that as . numeric returns the 
index of the levels. If the variable that determines the groups used to set 
the plotting symbols is not a factor, then we might run into problems. 
To circumvent this, we can always start by first converting the grouping 
variable to a factor and then converting it back using as . numeric. 

In the CO 2 dataset there is also information on the treatment — whether 
or not that grass was chilled. In the code example below we use the 
interaction function to create a vector that gives combined infor¬ 
mation on origin and treatment. This result is automatically a factor, so 
we can extract the factor level indices and use those to plot the origin 
in black or white and the treatment using either circles (non-chilled) 
or squares (chilled). The resulting plot is found in the right panel of 
Figure 4.27. 
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> origintreat <- interaction(C02$Treatment, C02$Type) 

> head(origintreat) 

[1] nonchilled.Quebec nonchilled.Quebec nonchilled.Quebec 
[4] nonchilled.Quebec nonchilled.Quebec nonchilled.Quebec 
4 Levels: nonchilled.Quebec ... chilled.Mississippi 

> levels(origintreat) 

[1] "nonchilled.Quebec" "chilled.Quebec" 

[3] "nonchilled.Mississippi" "chilled.Mississippi" 

> plotcode <- as.numeric(origintreat) 

> plot(C02$conc, C02$uptake, 

+ pch=c(21, 22, 21, 22)[plotcode], 

+ bg=c("black", "black", "white", "white")[plotcode], 

+ xlab="Concentration", ylab="Uptake") 


See also: The ggplot2 package extends the base plotting system in R 
and can easily change labels according to groups or variables. 


4.26 Identify points in a scatter plot 

Problem: You want to identify individual points in a scatter plot. 

Solution: The ident i f y function can be used to interactively identify 
individual points in an existing graphics device. It requires either the 
coordinates of the points in the scatter plot as the first two input argu¬ 
ments, x and y, or simply any R object which defines coordinates like 

xy. coords. 

By default, identify registers the position of the graphics pointer 
when the mouse button is pressed and searches for the closest point 
as defined by the tolerance option. If any such point is found on the 
plot, then the observation number is printed on the plot. Plotting the 
observation number can be omitted by setting plot=FALSE. 
identify is terminated by pressing either any other mouse button 
than the first or by pressing the ESC key, depending on the graphics 
device. When identify terminates, it prints the indices of points that 
were identified while the function was running. 

> X <- 1:10 

> y <- x**2 

> plot (x, y) 

> identify(x,y) 

[1] 3 9 10 

Observations 3, 9, and 10 were identified in the example above. 

Instead of printing the index number for the closest observation, we 
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can use the labels option to set the labels to print for each point. For 
example if we wish to know if a point has an odd or even x value, then 
we could use the following code 

> plot (x, y) 

> group <- rep(c("Odd", "Even"), 5) 

> identify(x,y, labels=group) 

[1] 2 3 8 

The identify function is only supported on some graphics screen de¬ 
vices and will do nothing on an unsupported device. 

See also: The locator function can be used to print the position of the 
graphics cursor when the mouse button is pressed. 


4.27 Visualize points, shapes, and surfaces in 3D and in¬ 
teract with them in real-time 

Problem: You want to visualize data in 3D and be able to interactively 
turn, rotate, and zoom in and out on the plot. 

Solution: Three-dimensional images can be difficult to depict in two 
dimensions but the rgl package provides functionality that offers three- 
dimensional, real-time visualization of points, shapes and surfaces, and 
it allows the user to generate interactive 3D graphics to help visualize 
the plot. 

The rgl package uses OpenGL (the Open Graphics Library) for ren¬ 
dering 3D images in real-time and creates a graphics device which the 
user can interact with. Since the graphics are rendered in real-time it is 
possible to turn, scroll, rotate, and zoom in and out on the plot. 

The rgl package has a number of functions for 3D plotting, and we will 
just illustrate a few of them here. To really appreciate the rgl package 
you should run the code and not just look at the screen dumps shown 
in Figure 4.28. 

The plot 3d function creates a 3D scatter plot, and takes three numeric 
vectors (corresponding to the values on the x, y, and z axes, respec¬ 
tively) as the first three arguments. Like plot, plot3d can plot points 
(use option type="s" for spherical points), lines (option type="l"), 
or vertical lines relative to the x-y-plane (type="h"). The size option 
sets the size of the plotted spheres, and col is used to set the plotting 
color. 

The following code plots the height, girth and volume variables from 
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Figure 4.28: Examples of snapshots of output from the rgl package. Upper 
plots are created by plot3d while the lower plots are made by surf ace3d. 


the trees dataset, and it is possible to rotate the points to see the 
nice relationship between the volume and the height and girth. Screen 
dumps of the output are seen in the upper plots of Figure 4.28. 

> library(rgl) 

> data(trees) 

> attach(trees) 

> plot3d(Height, Girth, Volume, type="s", size=2) # Plot spheres 

> plot3d(Height, Girth, Volume, type="h") # Vertic. lines 

The persp3d function is similar to persp and can be used to create 
3D surface plots. As usual, coloring of the surface can be set using 
the col option but persp3d also has the front, back and smooth 
options which are used to set the front and back polygon fill mode, 
and whether shading are added to the rendering, respectively, front 
and back defaults to "fill" which produces filled polygons — both 
when viewed from "above" (the front) and "below" (the back). Other 
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possibilities are "line" for wireframed polygon, "point" for point 
polygon, and "cull" for hidden polygon. When the smooth option 
is set to TRUE then smooth shading is used for the rendering instead of 
the default flat shading. 

The volcano dataset contains topographical data about the Maunga 
Whau volcano and consists of a matrix of heights corresponding to a 
10 meter by 10 meter grid. The lower plots in Figure 4.28 show screen 
dumps from the following code. 

> data(volcano) 

> z <- volcano # Height 

> x <- 10 * (l:nrow(z)) # 10 meter spacing (S to N) 

> y <- 10 * (l:ncol(z)) # 10 meter spacing (E to W) 

> persp3d(x, y, z, smooth=TRUE, col="lightgray", 

+ xlab="NS axis", ylab="EW axis", zlab="Height") 

> persp3d(x, y, z, col="darkgray", front="line", back="cull", 

+ xlab="NS axis", ylab="EW axis", zlab="Height") 

Both plot3d and persp3d have an add option, which can be set to 
TRUE if points or surfaces should be added to the current plot instead 
of creating a new plot. In general, graphics parameters can be set or 
queried using the par3d function, which has the same functionality as 
the par function except it works with the functions in rgl. 

Graphical animations of 3D plots can be created by the movie3d or 
pi ay 3d functions. The only difference between the two functions is 
that movie3d saves the animation as a file (for example as an animated 
GIF file) while pi ay 3d plays the animation directly on the screen. Both 
functions work on the current rgl device and for input they require 
a function that returns a list of viewpoints (readable by par 3d) used 
for the animation. The spin3d function is an example of a function 
that provides viewpoints by spinning the current plot. spin3d takes 
two arguments: axis which is a vector of length 3 which contains the 
desired axis of rotation, and rpm which set the rotation speed in rota¬ 
tions per minute. When creating an animated file, the duration option 
should be set to determine the duration in seconds. The output location 
of the movie is set with the dir option. 

The following code creates an animated GIF file where the current plot 
rotates slowly around the z axis. 

> movie3d(spin3d(axis=c(0,0,1), rpm=3), duration=60, dir=".") 

See also: The decor ate 3d function is used to add boxes, labels and 
axes to plots in rgl. Screen shots can be exported from an rgl graph¬ 
ics device using the rgl .postscript function which can export in 
different formats, rgl. light can be used to add light sources to the 
rendering. 
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4.28 Exporting graphics 

Problem: You have made a nice graph and want to export it so you 
can use it in a manuscript or elsewhere. 

Solution: R is extremely useful for producing high-quality graphics. 
Here we focus on how to save plots as pdf files so they can be included 
in other documents. R can export to several other graphics formats but 
we will mainly focus on the pdf file format here because it is easily used 
under most operating systems. 

Base R plotting uses an ink-on-paper approach where each graphics 
command either creates a new plot or adds to the current plot. Once 
you have finished a complete plot, you can save the current graphics 
device to a pdf file using the dev. copy 2pdf function. If no filename is 
specified, then a file called Rplot. pdf is created in the current direc¬ 
tory. 

> X <- 1:10 

> y <- rnorm(lO) 

> plot(x,y) # Create a scatter plot 

> title("Very lovely graphics") # Add title 

> dev.copy2pdf(file="lovelygraph.pdf") # Copy graph to pdf file 

The width and height of the output file created by dev.copy2pdf 
are taken from the current device unless otherwise specified with the 

width or height options. 

The dev .print function copies the graphics contents of the current de¬ 
vice to a new device. Hence, we can save the current graphics device in 
a different format by adding the relevant device option when calling 
dev.print. The following lines show how to create other graphical 
formats. Both png and jpeg files measure the width and height in pix¬ 
els and the output size needs to be set by specifying at least one of the 
width or height options 

> dev.print(file="pngfile.png", device=png, width=480) 

> dev.print(file="jpgfile.jpeg", device=jpeg, 

+ width=480, height=240) 

Note that dev. copy2pdf and dev. print copy the entire plotting de¬ 
vice region. This includes the background color which is transparent 
for most screen devices so that can have undesirable consequences for 
graphics formats like png that accommodate transparent colors. 
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As an alternative to dev. copy2pdf and dev. print, the plot may be 
produced directly to a file, without printing it on the screen at all: 

> pdf("gorgeousgraph.pdf") # Opens a pdf device 

> plot(x,y) # Create a scatter plot 

> title("Very lovely graphics") # Add title 

> dev.off() # Close device and file 

The pdf function opens a file as the current graphics device, adds all 
graphics output to the file until the dev .off commands closes the file 
and ends the device. The width and height of the graphics is set by the 
width and height options, respectively, and are measured in inches 
(both default to 7 inches). The files are saved in the working directory 
unless a path is given in the file name. 

See also: The pdf function (as well as the other graphical device func¬ 
tions) has additional options that control fonts, paper size, etc. See the 
help file for more information. The functions png, jpeg, and bmp are 
used to create Portable Network Graphics, Joint Photographic Experts 
Group and bitmap image file formats, respectively. The postscript 
function creates Postscript files and win . metafile can be used to cre¬ 
ate a Windows metafile. 


4.29 Produce graphics output in ETpX-ready format 

Problem: You want to produce graphics that are typeset directly in 
TTpX so the plots have a clean, unified look with the rest of the text. 

Solution: If you are typesetting text with ETpX you can use the Ti/cZ 
graphics language to typeset R graphics from within ETpX such that the 
plots have a clean, unified look with the remaining text. 

The tikz function from the tikzDevice package opens a Ti/cZ graph¬ 
ics device and creates a file, Rplots . tex, that contains the necessary 
Ti/cZ code to be included in PTpX. The default file name is changed 
with the file option. Graphics output from tikz is typeset by TTpX 
and consequently will match whatever fonts are currently used in the 
document. 

The following code shows an example of how to create Ti/cZ output, 
where the relationship between ozone and temperature is modeled for 
the airquality data frame. 

> library(tikzDevice) 

> data(airquality) 


www.Ebook777.com 



Graphics 


249 




Figure 4.29: Rendering of the same graphics using base R graphics (left) and a 
Tit'Z-device in ETgX (right). 


> attach(airquality) 

> tikz(file="tikzexample.tex") 

> plot(Temp, Ozone) 

> model <- lm(Ozone ~ Temp) 

> abline(model, lwd=3, lty=2) 


# Open tikz device 

# Make scatter plot 

# Estimate lin. reg. 

# Add line to plot 


> # Add a lowess smoothed line to the plot as well 

> # lowess requires complete cases only 

> cc <- complete.cases(Temp, Ozone) 

> lines(lowess(Temp[cc], Ozone[cc]), lwd=3, lty=l) 

> # Add a legend 

> legend(60, 150, c("Lowess", "Lin. reg."), lty=l:2, lwd=3) 

> dev.off() # Close tikz device 

> detach(airquality) 


The following lines show how to include the file produced by t i k z in a 
UTpCdocument. Figure 4.29 shows an example of the graphics created 
by the code above both as regular R graphics and as Ti/cZ graphics. 
Alternatively, if the option standAlone is set to TRUE then the output 
produced by t i k z will be wrapped in a complete UTpX document that 
can be compiled immediately. 

\documentclass{article} 

\usepackage{tikz} % Necessary for tikz output 

\begin{document} 

Here goes some text. 

\begin{figure}[h] 

\begin{center} 

\input{tikzexample.tex} % Import output from tikz 

\end{center} 

\caption{Simple example} 
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Mabel { f ig : myf igure } 

\end{figure} 

Some text after the figure. See 
Figure~\ref{fig:myfigure} to see my 
new integrated plot. 

\end{document} 

Note that it is necessary to have ETgX and the Ti/fZ package installed to 
typeset the final graphical output. When tikzDevice loads, it searches 
for the location of a IiTpX compiler in order to be able to query the com¬ 
piler for string widths and font methods. If no KTgX executable is found 
then the path can be set manually with the global tikzLatex option, 
e.g., options (tikzLatex=" /path/to/latex/compiler" ). 


4.30 Embed fonts in postscript or pdf graphics 

Problem: You have made a nice plot in pdf or postscript format and 
want to embed all fonts to ensure that it looks the same on all machines. 

Solution: If you wish to share your postscript or pdf graphics files and 
want to make sure that it will appear the same on all computers, you 
must make sure that all the necessary fonts are embedded in the file. 

By default R does not embed its fonts when producing pdf files but that 
is generally not a problem if only the standard default fonts (e.g.. Times, 
Courier, Helvetica) are used since these standard fonts are part of the 
postscript and pdf formats. However, if you use special fonts and want 
to make sure that they are included in the postscript or pdf file, then the 
embedFonts function can be used. 

embedFonts takes an existing postscript or pdf file and embeds fonts 
in the file. It takes the name of the existing file as first argument and 
overwrites the existing file unless the out f i 1 e option is given to set the 
name of the new graphics file with all the fonts embedded. embedFont s 
relies on the external program Ghostscript to do the actual conversion, 
so that program should be installed on the computer and the executable 
should be in the search path so it can be found by R. 

To embed the necessary fonts for the graphics file myplot .pdf located 
in the current working directory, we type: 

> embedFonts("myplot.pdf") 

Note that embedFonts does not embed the standard pdf fonts such 
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as Helvetica and Times in the graphics file since these fonts are copy¬ 
righted and cannot be distributed legally without licensing. Since R 
uses the Helvetica font by default, it may be necessary to use a free 
font such as "Nimbus Sans" instead to make sure it is embedded. The 
available pdf fonts can be seen with the pdf Fonts function. 

> names(pdfFonts()) 

[1] "serif” 

[3] "mono" 

[5] "Bookman" 

[7] "Helvetica" 

[9] "NewCenturySchoolbook" 

[11] "Times" 

[13] "URWBookman" 

[15] "NimbusSan" 

[17] "NimbusSanCond" 

[19] "URWPalladio" 

[21] "URWTimes" 

[23] "JapanlHeiMin" 

[25] "JapanlRyumin" 

[27] "Korealdeb" 

[29] "GB1" 

Thus to make sure all fonts are embedded, we can use the following 
code where the argument useDingbats is set to FALSE to prevent the 
use of the copyrighted ZapfDingbats font for creating symbols. 

> pdf(file="test.pdf", useDingbats=FALSE) 

> par(family="NimbusSan") 

> plot(1,1, pch=16) 

> title("All embedded") 

> dev.off() 

> embedFonts("test.pdf") 

Alternatively, the Cairo package can create a Cairo graphics device 
which extends the default R graphics device and which automatically 
includes fonts in postscript and pdf files. Cairo devices also allow for 
extra graphical features not found in the base R graphics device. 

The CairoPS and CairoPDF functions initialize new postscript and 
pdf graphics devices in the same way as the pdf and postscript 
functions, respectively. The following code shows that CairoPDF is 
used just like pdf. 

> library(Cairo) 

> CairoPDF("mycairoplot.pdf") 

> hist(rnorm(100)) 

> dev.off() 


"sans" 

"AvantGarde" 
"Courier" 

"Helvetica-Narrow" 

"Palatino" 

"URWGothic" 

"NimbusMon" 

"URWHe1vetica" 

"CenturySch" 

"NimbusRom" 

"Japan1" 

"Japan1GothicBBB" 
"Koreal" 

"CNS1" 


See also: The fonts to pdf can be used to specify R graphics font 
family names to include in the pdf file. 
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5.1 Getting help 

Problem: You have run into a problem and are looking for help about 
a function or topic. 

Solution: There are several ways of getting help in R but sometimes 
it may be difficult to find the answer to a particular question. Knowl¬ 
edge of the different ways of using the help system greatly improves 
the chance of locating the desired information. 

The help function returns the documentation about a function or data¬ 
set. It is the primary source of information about functions, their syntax, 
arguments and the values returned. 

The apropos function lists the command names that contain a specific 
pattern if you do not know the exact name of the function. Alterna¬ 
tively, help . search searches for a match in the name, alias, title, con¬ 
cept or keyword entries in the help system and installed packages, and 
the search is more thorough than apropos. 

RSiteSearch performs a search of help pages, the R-help mailing list 
archives, and vignettes and returns the result in a web browser. 

Not all of the output from the calls below is shown. 

> help(mean) # Get help about the function mean 

> help("mean") # Same as above 

> ?mean # Same as above 

> help.start() # Start browser and the general help 


> apropos("mean") 



in 

"colMeans" 

"kmeans" 

"mean" 

[4] 

"mean.data.frame" 

"mean.Date" 

"mean.default 

[7] 

"mean.difftime" 

"mean.POSIXct" 

"mean.POSIXlt 

[10] 

"rowMeans" 

"weighted.mean" 


> help.search("mean") 
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Help files with alias or concept or title matching 
'mean' using regular expression matching: 

gplots::bandplot Plot x-y Points with Locally Smoothed 
Mean and Standard Deviation 

gplots::plotmeans 

Plot Group Means and Confidence 
Intervals 


base:: 

DateTimeClasses 




Date-Time Classes 


base:: 

Date 

Date Class 


base:: 

colSums 

Form Row and Column 

Sums and 

base:: 

difftime 

Time Intervals 


base:: 

mean 

Arithmetic Mean 


boot:: 

sunspot 

Annual Mean Sunspot 

Numbers 

lattice::tmd 

Tukey Mean-Difference Plot 

Matrix::Matrix- 

-class 



Virtual Class "Matrix" Class of 
Matrices 

Matrix::colSums Form Row and Column Sums and Means 
Matrix::dgeMatrix-class 

Class "dgeMatrix" of Dense Numeric 
(S4 Class) Matrices 

rpart::meanvar Mean-Variance Plot for an Rpart 

Object 

stats::kmeans K-Means Clustering 

stats::oneway.test 

Test for Equal Means in a One-Way 
Layout 

stats::weighted.mean 

Weighted Arithmetic Mean 

> ??mean # Same as above 

> RSiteSearch("sas2r") # Search for the string sas2r in online 

# help manuals and archived mailing lists 


The help, start function opens a browser window with access to 
introductory and advanced manuals, FAQs, and reference materials. 
Note that help . start uses a Java applet to initiate the browser-based 
search engine and a compatible version of Java must be installed on 
your system, and both Java and JavaScript need to be enabled in your 
browser for this to function properly. 

See also: The example function executes the example code associated 
with a function and can be used to run the examples. See Rule 5.7 for 
more information on vignettes. 
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5.2 Finding R source code for a function 

Problem: You want to see the R source code for a function. 

Solution: The source code for many R functions can be viewed simply 
by typing the function name at the command prompt. For example, to 
see the way the standard deviation function sd is defined we type 

> sd 

function (x, na.rm = FALSE) 

{ 

if (is.matrix(x)) 

apply(x, 2, sd, na.rm = na.rm) 
else if (is.vector(x)) 

sqrt(var(x, na.rm = na.rm)) 
else if (is.data.frame(x)) 

sapply(x, sd, na.rm = na.rm) 
else sqrt(var(as.vector(x), na.rm = na.rm)) 

} 

<environment: namespace:stats> 


However, for generic functions like print, plot or boxplot where 
the type of input object determines the way the function works, we 
need to use the methods function to identify which methods are avail¬ 
able for the generic function. As shown below, we cannot see the source 
code for the boxplot function directly, but the call to methods shows 
which functions are called by boxplot. 

> boxplot 

function (x, ...) 

UseMethod("boxplot") 

<environment: namespace:graphics> 

> methods(boxplot) 

[1] boxplot.default boxplot.formula* boxplot.matrix 
Non-visible functions are asterisked 


The source code for the individual functions can be obtained by typing 
the function name directly but that only works for non-hidden meth¬ 
ods. Functions that are hidden inside a name space are shown with 
a trailing asterisk and the get Any where function can extract objects 
within all loaded name spaces 

> boxplot.matrix # See source for non-hidden method 
function (x, use.cols = TRUE, ...) 

{ 

groups <- if (use.cols) 

split (x, rep.int(1L:ncol(x), rep.int(nrow(x), ncol(x)))) 
else split(x, seq(nrow(x))) 
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if (length(nam <- dimnames(x)[[1 + use.cols]])) 
names(groups) <- nam 
invisible(boxplot(groups, ...)) 

} 

<environment: namespace:graphics> 

> boxplot.formula # Try to see source for hidden method 
Error: object 'boxplot.formula' not found 

> getAnywhere("boxplot.formula") 

A single object matching 'boxplot.formula' was found 
It was found in the following places 

registered S3 method for boxplot from namespace graphics 
namespace:graphics 
with value 

function (formula, data = NULL, ..., subset, na.action = NULL) 

{ 

if (missing(formula) || (length(formula) != 3L)) 
stop("'formula' missing or incorrect") 
m <- match.call(expand.dots = FALSE) 
if (is.matrix(eval(m$data, parent.frame()))) 
m$data <- as.data.frame(data) 
m$... <- NULL 
m$na.action <- na.action 
require(stats, quietly = TRUE) 
m[[lL]] <- as.name("model.frame") 
mf <- eval(m, parent.frame()) 

response <- attr(attr(mf, "terms"), "response") 
boxplot(split(mf[[response]], mf[-response]), ...) 

} 

<environment: namespace:graphics> 

The output from getAnywhere tells you that the boxplot function is 
located in the graphics name space and gives you the source code. 

R has two internal systems for handling generic functions. The older S3 
classes (as described above) and the newer internal S4 classes. Things 
are slightly different for the newer S4 classes where the showMethods 
and f indMethods functions are used to list the available methods and 
to get their source code, respectively. 

> library(lme4) 

> refit # Try to show the source for the refit function 

standardGeneric for "refit" defined from package "lme4" 

function (object, newresp, ...) 
standardGeneric("refit") 

<environment: 0xa97c674> 

Methods may be defined for arguments: object, newresp 
Use showMethods("refit") for currently available ones. 

> showMethods("refit") # Show available methods 

Function: refit (package lme4) 
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object="mer", newresp="matrix" 
object="mer", newresp="numeric" 

> # List the source code for refit for class type numeric 

> findMethods("refit", classes="numeric") 

An object of class ' 'listOfMethods '' 

$'mer#numeric' 

Method Definition: 

function (object, newresp, ...) 

{ 

newresp <- as.double(newresp[!is.na(newresp)]) 
stopifnot(length(newresp) == object@dims["n"]) 
object@y <- newresp 
mer_finalize(object) 

} 

<environment: namespace:lme4> 

Signatures: 

object newresp 
target "mer" "numeric" 
defined "mer" "numeric" 

Slot arguments: 

[1] "object" "newresp" 

Slot signatures: 

[ [ 1 ] ] 

[1] "mer" "numeric" 

Slot generic: 

standardGeneric for "refit" defined from package "lme4" 

function (object, newresp, ...) 
standardGeneric("refit") 

<environment: 0xa55ebf4> 

Methods may be defined for arguments: object, newresp 
Use showMethods("refit") for currently available ones. 


R packages 


5.3 Installing R packages 

Problem: You want to install an R package. 

Solution: Apart from the functions that are available in R, there exists 
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a very large number of R packages that extend the usefulness of R. An 
R package is essentially a collection of R functions and/or datasets. 
Assume we want to install the foreign package which can be found 
on CRAN. The command 

> install.packages("foreign") # Install the foreign package 

downloads and installs the package from the one of the CRAN mirrors 
and will work for most installations. Installation of a package is only 
required once, but you should make sure that each installed package is 
updated. 

Packages can be installed globally if R is run with administrator priv¬ 
ileges; otherwise, they must be installed locally. A package can be in¬ 
stalled in a local directory by setting the lib option to a directory where 
you have write permissions, for example 

> install.packages("foreign", lib="C:/my/private/R-packages/") 

The . libPaths function shows the locations where R currently looks 
for packages. These locations can be used for the lib option to make 
sure packages are installed in the same directory. 

> .libPaths() # Output from a machine running ubuntu 

[1] "/home/user_4/ekstrom/R/x8 6_64-pc-linux-gnu-library/2.10" 

[2] "/usr/local/lib/R/site-library" 

[3] "/usr/lib/R/site-library" 

[4] "/usr/lib/R/library" 

If you do not have Internet access or want to install a package that is 
not part of CRAN, then use the repos option for install. packages 
to tell R where the package can be found. If repos=NULL then R will 
install from a local file on the computer. 

> # Install a package from http://www.some.url.dk/Rfiles/ 

> install.packages("newRpackage", 

+ repos="www.some.url.dk/Rfiles/" ) 

> # Install the newRpackage package from a local source file 

> # located in the current working directory 

> install.packages("newRpackage_l.0-1.tar.gz", repos=NULL) 

The full path to a local package can be given, if it is not located in the 
current working directory. 

Packages may be distributed in source form or in compiled binary form. 
The default is to use binary packages for Windows and Mac OS X and 
source packages for unix-like systems. Source packages may contain 
C/C++/Fortran code that requires the necessary tools to compile them 
whereas binary packages are platform-specific and generally work right 
out of the box. install .packages consults CRAN for available pack¬ 
ages in the default form, but the argument type=" source" can be set 
to force R to download source packages. 
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The functions and datasets in a package cannot be accessed before the 
package is installed (and loaded in the current R session). The library 
command loads a package in the current R session. 

> library("foreign") # Loads the foreign package 

> library("isdals", lib.loc="C:/my/private/R-packages/") 

which should be run once in every R session for which you want to use 
functions from the package. 

See also: Rule 5.10 on how to automatically load packages when R starts. 
Rule 5.4 shows how to update packages. Look at Rule 5.9 if you per¬ 
manently want R to search a local directory for installed packages so 
you do not have to specify the lib option. In the Windows graphical 
user interface a package can also be installed from the Packages -> 
Install package (s) menu item. The Packages menu item can 
also be used to set the repositories or to install a downloaded zip pack¬ 
age for Windows (use Packages -> Install package(s) from 
local zip files). 


5.4 Update installed R packages 

Problem: You want to update the installed R packages. 

Solution: Use update . packages to automatically search the R repos¬ 
itories for newer versions of the installed packages and then download 
and update the most recent version on-the-fly. 

> update.packages() 

You need administrator privileges to update globally installed pack¬ 
ages. Locally installed packages are updated if the lib . loc option is 
specified, so to check for new versions of both global and local packages 
we use the command 

> update.packages(lib.loc="C:/my/private/R-packages/") 

Note that Windows may lock the files from a package if the package is 
currently in use. Thus, under Windows you may have to start a new 
R session where the packages have not been loaded before you run 

update.packages. 

See also: Look at Rule 5.9 if you permanently want R to search a lo¬ 
cal directory for installed packages so you do not have to specify the 
lib . loc option. 
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5.5 List the installed packages 

Problem: You want to see a list of the installed packages. 

Solution: The library function lists the installed packages and the 
locations where each of the packages are installed, when library is 
called without any arguments. 


> library() 

Packages in library '/usr/local/lib/R/site-library': 


gplots 

isdals 


Various R programming tools for plotting data 
Provides datasets for Introduction to 
Statistical Data Analysis for the Life Sciences 


Packages 


in library 


'/usr/lib/R/site-library': 


gdata 

gtools 

lme4 

RODBC 

snow 


Various R programming tools for data 
manipulation 

Various R programming tools 

Linear mixed-effects models using S4 classes 

ODBC Database Access 

Simple Network of Workstations 


Packages in library 


'/usr/lib/R/library': 


base 

boot 


The R Base Package 

Bootstrap R (S-Plus) Functions (Canty) 


See also: The installed.packages function also lists the installed 
packages and their location and provides additional information on 
version number, dependencies, licenses, suggested additional packages, 
etc. 


5.6 List the content of a package 

Problem: You want to view the content and the description of an in¬ 
stalled package. 

Solution: Use the package option with the help function to obtain 
information about an installed package and list its content. 

> help(package=foreign) 
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Description: 


Package: 

Priority: 

Version: 

Date: 

Title: 

foreign 

recommended 

0.8-40 

2010-03-26 

Read Data Stored by Minitab, S, SAS, 

SPSS, Stata, Systat, dBase, ... 

Depends: 

Imports: 

Maintainer: 

Author: 

R (>= 2.10.0), stats 
methods, utils 

R-core <R-core@r-project.org> 

R-core members, Saikat DebRoy 
<saikat@stat.wise.edu>, Roger Bivand 
<Roger.Bivand@nhh.no> and others: see 
COPYRIGHTS file in the sources. 

Description: 

Functions for reading and writing data 
stored by statistical packages such as 
Minitab, S, SAS, SPSS, Stata, Systat, 

..., and for reading and writing dBase 
files. 

LazyLoad: 

License: 

yes 

GPL (>= 2) 

BugReports: 
Packaged: 
Repository: 
Date/Publication: 
Built: 

http://bugs.r-project.org 

2010-03-28 15:53:20 UTC; ripley 

CRAN 

2010-04-02 17:18:08 

R 2.10.1; x86_64-pc-linux-gnu; 

2010-05-16 20:34:23 UTC; unix 


Index 


data.restore 
lookup.xport 

Read an S3 Binary or data.dump File 
Lookup Information on a SAS XPORT Format 
Library 

read.arff 

read.dbf 

read.dta 
read.epiinfo 
read.mtp 
read.octave 
read.spss 
read.ssd 

Read Data from ARFF Files 

Read a DBF File 

Read Stata Binary Files 

Read Epi Info Data Files 

Read a Minitab Portable Worksheet 

Read Octave Text Data Files 

Read an SPSS Data File 

Obtain a Data Frame from a SAS Permanent 

read.systat 
read.xport 
write.arff 

write.dbf 

write.dta 
write.foreign 

Dataset, via read.xport 

Obtain a Data Frame from a Systat File 
Read a SAS XPORT Format Library 

Write Data into ARFF Files 

Write a DBF File 

Write Files in Stata Binary Format 

Write Text Files and Code to Read Them 
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5.7 List or view vignettes 

Problem: You want to list the available vignettes or view a specific 
vignette. 

Solution: Vignettes are additional documents that accompany some R 
packages and provide additional information, user guides, examples or 
a general introduction to a topic or to the package. They are typically 
provided as pdf files (often also with the source file included) and it is 
possible to extract any R code so the user can run and experiment with 
the code. 

The vignette function shows or lists available vignettes. If no ar¬ 
guments are given to vignette then it lists all vignettes from all in¬ 
stalled packages. The all argument can be set to FALSE to only list 
vignettes from attached packages. If a character string is given as an ar¬ 
gument, then the vignette function will display the vignette match¬ 
ing the string. The optional package argument can be supplied with a 
character vector to restrict the packages that are searched through. The 
edit function used on the object returned by the vignette function 
shows the R source code from the vignette. 

> vignette(all=FALSE) # List all attached vignettes 
no vignettes found 

Use 'vignette(all = TRUE)' 

to list the vignettes in all ^available* packages. 

> vignette(all=TRUE) # List all installed vignettes 
Vignettes in package 'coin': 

LegoCondlnf A Lego System for Conditional 

Inference (source, pdf) 

MAXtest Order-restricted Scores Test (source, 

pdf) 

coin coin: A Computational Framework for 

Conditional Inference (source, pdf) 

coin_implementation 

Implementing a Class of Permutation 
Tests: The coin Package (source, pdf) 


> library(Matrix) 

> vignette(all=FALSE, package="Matrix") 

Vignettes in package 'Matrix': 

Comparisons Comparisons of Least Squares 

calculation speeds (source, pdf) 
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Design-issues 


Design Issues in Matrix package 
Development (source, pdf) 

2nd Introduction to the Matrix 
Package (source, pdf) 

Introduction to the Matrix Package 
(source, pdf) 

Sparse Model Matrices (source, pdf) 


Intro2Matrix 


Introduction 


sparseModels 


Use 'vignette(all = TRUE)' 

to list the vignettes in all *available* packages. 


> vignette("Intro2Matrix") 

> vign <- vignette("Intro2Matrix") 

> edit(vign) 


# Show vignette 

# Store vignette 

# Extract/edit source code 


The edit function starts an editor window where the extracted R code 
can be viewed and altered. 


5.8 Install a package from BioConductor 

Problem: You want to install an R package that is part of the BioCon¬ 
ductor project. 

Solution: The BioConductor project provides tools for the analysis 
and handling of high-throughput genomic data and contains a large 
number of R packages that cannot be found on CRAN. 

BioConductor packages are most easily installed using the biocLite 
function from the biocLite . R installation script. We use the source 
function to read the installation script directly from its location on the 
Internet, and once it is read we have access to the biocLite function. 

If we call biocLite without any arguments, then we will automati¬ 
cally install around 20 common packages from the BioConductor project. 
Alternatively, we can specify the desired package as the first argument. 

> # Read the installation script directly from BioConductor site 

> source("http://bioconductor.org/biocLite.R") 

> biocLite("limma") # Install the limma package 

> biocLite() # Install the common BioConductor packages 

The update .packages function can update installed packages from 
the BioConductor project if the correct repository is set with the repos 
option. The biocLite.R script provides the biocinstallRepos, 
which returns the correct URL for the BioConductor repository. 


www.Ebook777.com 



264 


The R Primer 


> source("http://bioconductor.org/biocLite.R") 

> update.packages(repos=biocinstallRepos(), ask=FALSE) 


5.9 Permanently change the default directory where R 
installs packages 

Problem: You want to permanently change the default directory where 
R installs packages so you do not have to manually specify the desired 
installation directory every time a new package is to be installed. 

Solution: If you do not have global/administrator rights on a com¬ 
puter, then it can be desirable to permanently setup R to install pack¬ 
ages in a local directory 

The current site-wide directories used for installation can be identified 
with the . libPaths function: 

> .libPaths() # Example for ubuntu installation 

[1] "/home/user_4/ekstrom/R/x8 6_64-pc-linux-gnu-library/2.10" 

[2] "/usr/local/lib/R/site-library" 

[3] "/usr/lib/R/site-library" 

[4] "/usr/lib/R/library" 

> .libPaths() # Example for Windows 7 installation 

[1] "C:/PROGRAM-l/R/R-211-1.1-X/library" 

The default location for package installation can be overridden by set¬ 
ting the R_LIBS_USER environment variable. The value of an environ¬ 
mental variable can be seen from within R by using the Sys . getenv 
function: 

> Sys.getenv("R_LIBS_USER") # On machine running Ubuntu 

R_LIBS_USER 

"~/R/i686-pc-linux-gnu-library/2.11" 

> Sys.getenv("R_LIBS_USER") # On machine running WinXP 

R_LIBS_USER 

"C:\\Documents and Settings\\Claus\\Dokumenter/R/win-library/2.9" 

> Sys.getenv("R_LIBS_USER") # On machine running Mac OS X 

[1] "~/Library/R/2.13/library" 

The default path specified by R_LIBS_USER is only used if the corre¬ 
sponding directory actually exists, so in order for the R_LIBS_USER 
path to have any effect, the directory may have to be created manually. 
The R_LIBS_USER environmental variable is set in the . Renviron file 
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and at startup, R searches for an . Renviron file in the current direc¬ 
tory or in the user's home directory in that order. Thus, if different 
projects are kept in separate directories, then different environment se¬ 
tups can be used by using a different . Renviron file in each directory 
Under Windows the . Renviron file can be placed in the current direc¬ 
tory or in the user's Documents directory. 

For example, creating the .Renviron file with the following contents 
adds the path -/sandbox/Rlocal to the set of libraries. 

R_LIBS_USER="-/sandbox/Rlocal" 

When R is started the specified package library is included among the 
libraries found by R (provided that the directory exists): 

> .libPathsO 

[1] "/home/claus/sandbox/Rlocal" 

[2] " /usr/local/lib/R/site-library" 

[3] "/usr/lib/R/site-library" 

[4] "/usr/lib/R/library" 

The R_LIBS_USER environmental variable can specify multiple library 
paths by providing a list of paths separated by colons (semicolons on 
Windows). 


5.10 Automatically load a package when R starts 

Problem: You want to make sure that a package is loaded automati¬ 
cally when R starts. 

Solution: At startup, R searches for a site-wide startup profile file 
and a user profile file (in that order). Code added to any of these files 
are run when R launches so we can add the necessary commands in 
the site-wide profile or user profile to automatically load a package. 
Code added to the individual user profile file obviously only affects 
that particular user, while changes to the site-wide file affects all users. 
The location of the site-wide and user profile files are determined by 
the R_PROFILE and R_PROFILE_USER environment variables, respec¬ 
tively. If R_PROFILE is unset, then the default site-wide profile file is 
R_HOME/etc/Rprof ile . site, where R_HOME points to the R instal¬ 
lation directory. R searches for a file .Rprofile in the current direc¬ 
tory or in the user's home directory in that order if the environment 
variable R_PROFILE_USER is unset. 
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The following code can be added to, say, the .Rprofile file to have 
the foreign package loaded for the current user when R is launched. 

# Example of .Rprofile to have the foreign package loaded 

# automatically when R launches 
local({ 

options(defaultPackages=c(getOption("defaultPackages"), 

"foreign")) 

}) 

Note that we call the getOption function to ensure that the default 
packages from R are also loaded and then add any additional packages 
in the call to options. Everything is wrapped in the local function 
in case any of the packages require any code to be executed when they 
start to prevent them from cluttering up the workspace. 

Use the following commands with the Sys . getenv function to deter¬ 
mine the relevant environmental variable or to find the R_HOME folder. 

> Sys.getenv("R_PROFILE") 

R_PROFILE 

> Sys.getenv("R_PROFILE_USER") 

R_PROFILE_USER 

> # Identify the installation directory 

> Sys.getenv("R_HOME") # Output from machine running Ubuntu 

R_HOME 

"/usr/lib/R" 

> Sys.getenv("R_HOME") # Output from machine running Windows XP 

R_HOME 

"C:\\Program Files\\R\\R-2.9.2\\" 

Hence, to make the foreign load automatically for all users on a ma¬ 
chine running Windows XP we would add the necessary lines to the 
file C : \Program Files\R\R-2.9.2\etc\Rprof ile . site. 

See also: See help (Startup) in R to get more information on the R 
startup process. 


The R workspace 


5.11 Managing the workspace 

Problem: You wish to list or delete objects in the R workspace. 
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Solution: The R workspace is your current working environment and 
includes any user-defined objects: vectors, matrices, dataframes, lists, 
and functions. 

The Is function lists the names of the existing objects in the current 
workspace, rm is used to remove objects. 


> # Create two vectors 

> x <- 1:5 

> myfactor <- factor(c("a", "b", "b", "b", 

> ls() # List local objects 

[1] "myfactor" "x" 

> x # Print x 

[1] 1 2 3 4 5 

> rm(x) # Remove object x 

> x # Try to print x 

Error: object 'x' not found 

> ls() 

[1] "myfactor" 


"c") ) 


5.12 Changing the current working directory 

Problem: You want to change or identify the current working direc¬ 
tory. 

Solution: The functions getwd and setwd are used to get and set the 
current working directory of R. 

> getwd() # Get the current working directory 

[1] "C:/R" 

> setwd("d:/") # Set the current working directory to d:/ 

Note that when specifying paths in R you should generally use the for¬ 
ward slash '/'to indicate directories. This is standard on unix-like sys¬ 
tems but different from Windows, and using forward slashes makes 
your R-code portable to all operating systems. For example, R will 
interpret the path c: /My Documents/mydata. txt correctly under 
Windows. The backslash character 'V functions as an escape character 
in R character strings and should be double quoted if used in paths; i.e., 
c:\\My DocumentsWmydata.txt. 
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5.13 Saving and loading workspaces 

Problem: You wish to save the current workspace or load a saved 
workspace. 

Solution: The save . image function saves the current workspace and 
load is used to load a saved workspace into the current session. If no 
path is specified in any of the functions, then R saves and looks for the 
workspace file in the current working directory. 

> ls() # List the current objects 

character(0) 

> x <- 1:5 # Create a new vector 

> ls() # List the current objects 

[1] "x" 

> save.image("tmp/.RData") # Save the workspace to tmp/.RData 

> q() # Quit R 

When R is restarted, we use load to load the workspace. 

> ls() # List the current objects 

character(0) 

> load("tmp/.RData") # Load the saved image 

> ls() # List the current objects 

[1] "x" 

> x 

[1] 1 2 3 4 5 

The default file name for save . image is . RData and you need write 
permissions for the relevant directory for save . image to work. 

On startup, R automatically loads a saved workspace if it is found in 
the current working directory. Likewise, whenever you exit R you are 
asked if you wish to save the current workspace (to the current working 
directory). 

Note that a saved workspace only contains the saved objects — not the 
command history or the loaded packages. 

See also: Rule 5.12 shows how to change the current working directory. 
Rule 5.14 shows how to save and restore histories and the library 
function loads an installed package. 


5.14 Saving and loading histories 

Problem: You want to save (and subsequently load) the commands 
you have issued in the current R session. 
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Solution: R allows the user to save (and reload) the complete session 
commands history as a text file. This allows the user to save the com¬ 
mands for a later time (and possibly to use them again) and keep the 
complete command history as documentation. 

The savehistory function saves the complete R session history. The 
commands history is saved as a plain text file with the default name 
. Rhi story. The file name and location can be set with the file option 
and the location is relative to the current working directory. 

> x <- 1:5 

> y <- c(2, 3, 3, 1, 2) 

> lm(y ~ x) 

Call: 

lm(formula = y ~ x) 

Coefficients: 

(Intercept) x 

2.8 - 0.2 

> savehistory("myanalysis.txt") 

The file myanalysis . txt now contains the lines 

x <- 1:5 

y <- c (2, 3, 3, 1, 2) 
lm(y ~ x) 

savehistory("myanalysis.txt") 

Note that savehistory only saves the commands used during the 
session and not the actual objects that were created. Thus, you still 
need to save a copy of any external dataset file if data were read from 
an external source. 

When R is restarted we read in a saved history with the loadhistory 
function. 

> loadhistory("myanalysis.txt") 

> x # x is not found 

Error: object 'x' not found 

loadhistory only reads in the list of saved commands but the actual 
commands are not rerun. They are only entered in the list of previous 
commands. 

To get R to read and evaluate all the commands from a file the source 
function should be used, source accepts a file name, URL or connec¬ 
tion as argument and reads (and executes) input from that source 

> source("myanalysis.txt") 

> Is () 

[1] "x" "y" 
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See also: Rule 5.13 explains how to save the objects in the current 
workspace. 


5.15 Interact with the file system 

Problem: You wish to interact with the file system for example to list 
or rename files or to create new directories. 

Solution: The dir function (alternatively the list.files function) 
lists the files in the current working directory. The path option can 
be set to change the directory listed, and the pattern option can be 
set to a regular expression such that only files which match the regular 
expression are listed. 

> dir() # List files in the current working directory 

[1] "addtable.r" "auc.r" "barplot2.r" "boxcox.r" 

[5] "break-mark.r" "ci.r" "cluster.r" "contour.r" 

[9] "gfx.r" "graphhelp.r" "heatmap.r" "label.r" 

[13] "layout.r" "lda.r" "legend.r" 

> dir(pattern=" A b") # List files starting with b 

[1] "barplot2.r" "boxcox.r" "break-mark.r" 


R has a large number of functions that provide an interface to the com¬ 
puter's file system like file, copy, file, remove, file, rename, 
file, append, and file, exists. The functions have the obvious 
functionality indicated by their names, and they take one or two char¬ 
acter vectors containing file names or paths as input arguments. The 
dir. create function creates a new directory and takes a character 
vector containing a single path name as input. 


> flie.copy("boxcox.r", "boxcox2.r") # Copy file to boxcox2.r 

[1] TRUE 

> file.rename("boxcox2.r", "newrcode.r") # Rename to newrcode.r 


[1] TRUE 

> file.exists("newrcode.r") 
[1] TRUE 

> file.exists("boxcox2.r") 
[1] FALSE 

> file.remove("newrcode.r" ) 
[1] TRUE 

> dir.create("newdir") 


# Check if file exists 

# Old file is gone 

# Delete the file 

# Make new directory 


The basename and dirname functions extract the base (i.e., all but the 
part after the final path separator) of a file name or path and the path 
up to the last path separator, respectively. 
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Figure 5.1: File selection dialogs for Mac OS X (left) and unix-like systems 
(right). 


> basename("d:/ont/mention/the/war.txt") 

[1] "war.txt" 

> dirname("d:/ont/mention/the/war.txt") 

[1] "d:/ont/mention/the" 

See also: Rule 5.16 gives examples of the file . choose command. The 
Sys . chmod function changes file permissions. 


5.16 Locate and choose files interactively 

Problem: You want to interactively locate and choose a file. 

Solution: Several R functions — for example, for importing data from 
external sources or for exporting data or graphics — require an argu¬ 
ment that specifies the name of a file. 

The file, choose can be used in lieu of giving the file name in the 
function call. Instead the user is asked to interactively provide the name 
of the file. On unix-like systems, this will be a simple text input prompt, 
where the user types in the file name. However, under Windows and 
Mac OS X the file . choose function opens a graphical dialog box so 
the user can choose the file interactively (see Figure 5.1). 

The file . choose function returns a vector of selected file name(s) or 
an error if none are chosen. 

> mydata <- read.table(file.choose(), header=TRUE) 

There exist packages to provide the same graphical interactivity under 
unix-like systems, but they are not a part of base R . The t c 11 k package 
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uses the Tel language and the tk toolkit to create platform-independent 
widgets and it provides the tk_choose .files function which works 
just like the Windows version of file . choose. It requires Tel to be 
installed on the computer, but that is often the case for most unix-like 
machines. 

> library(tcltk) 

> mydata <- read.table(tk_choose.files(), header=TRUE) 

See also: The gfile function from the gWidgets package also pro¬ 
vides a file selection dialog, but requires installation of some additional 
graphic toolkit implementation. 


5.17 Interact with the operating system 

Problem: You wish to interact with the operating system, for example 
to run system commands. 

Solution: The system function runs system commands directly under 
the operating system. The main argument to system is a character 
string with the system command to invoke. For example, under linux 
we can use the df command to report disk space usage. 

> system("df -h") # File system disk space usage (unix) 


Filesystem 

Size 

Used 

Avail 

Use% 

Mounted on 

/dev/sdal 

47G 

7, 6G 

37G 

18% 

/ 

none 

5, 9G 

332K 

5, 9G 

1% 

/dev 

none 

5, 9G 

208K 

5, 9G 

1% 

/dev/shm 

none 

5, 9G 

264K 

5, 9G 

1% 

/var/run 

none 

5, 9G 

0 

5, 9G 

0% 

/var/lock 

none 

5, 9G 

0 

5, 9G 

0% 

/lib/init/rw 

/dev/sda3 

826G 

4, OG 

780G 

1% 

/ priv 

joy:/priv_2/users 

690G 

534G 

121G 

82% 

/home/user_4 

/dev/srO 

44M 

4 4M 

0 

100% 

/media/cdromO 


Under Windows, a somewhat similar result is given by the fsutil 
command (provided you have administrator privileges). 

> system("fsutil volume diskfree c:") # Disk usage (Windows) 

Total # of free bytes : 535560192 

Total # of bytes : 10725732352 

Total # of avail free bytes : 535560192 

The output from system function can be stored in R as a character 
vector (one for each output line produced by the system command) if 
the intern=TRUE option is set. 
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> when <- system("date", 

> when 

[1] "Wed Mar 16 01:20:22 


intern=TRUE) 
CET 2011" 
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Newcomers to R are often intimidated by the command-line interface, the 
vast number of functions and packages, or the processes of importing 
data and performing a simple statistical analysis. The R Primer provides 
a collection of concise examples and solutions to R problems frequently 
encountered by new users of this statistical software. 


Rather than explore the many options available for every command as 
well as the ever-increasing number of packages, the book focuses on 
the basics of data preparation and analysis and gives examples that can 
be used as a starting point. The numerous examples illustrate a specific 
situation, topic, or problem, including data importing, data management, 
classical statistical analyses, and high-quality graphics production. Each 
example is self-contained and includes R code that can be run exactly as 
shown, enabling results from the book to be replicated. While base R is 
used throughout, other functions or packages are listed if they cover or 
extend the functionality. 


\ 


Features 

• Presents concise examples and solutions to common problems in R 

• Explains how to read and interpret output from statistical analyses 

• Covers importing data, data handling, and creating graphics 

• Requires a basic understanding of statistics 

After working through the examples found in this text, new users of R 
will be able to better handle data analysis and graphics applications in 
R. Additional topics and R code are available from the book’s supporting 
website. 


CRC Press 

l ) Taylor & Francis Group 


an informa business 
www.crcpress.com 

6000 Broken Sound Parkway, NW 
Suite 300, Boca Raton, FL 33487 
711 Third Avenue 
New York, NY 10017 
2 Park Square, Milton Park 
Abingdon, Oxon OX14 4RN, UK 


KlEfi7L= 


ISBN : 

T 

’fl-l- 

43 c ia-tiE0L-3 







90000 

















































